I Built a Browser Skill for AI Because Screenshots Are Too Slow

April 9, 2026·Workflow Automation Patterns·5 min read

Last month I asked Claude Code to check my Google Search Console numbers. It opened the browser, took a screenshot, analyzed it, figured out where to click, took another screenshot, analyzed that one, clicked, screenshot, analyze, repeat.

It took nearly 3 minutes to pull data I could get in 20 seconds.

That's the problem with screenshot-based browser automation. It works. But it's so slow that you're better off just doing it yourself.

So I built something different. A single Python script — 1,200 lines — that talks directly to Chrome through its built-in remote control. No screenshots. No pixel analysis. Just text commands in, text data out.

The same Search Console task now takes about 15 seconds.

Why AI browser automation matters

Most of your work happens inside a browser. Not on the open web — behind logins. Your analytics dashboard. Your CRM. Your project management tool. Your accounting portal.

AI assistants can google things. ChatGPT, Gemini, Perplexity — they search the public web well. But they can't log into your tools. They can't check your specific numbers.

Browser automation bridges that gap. It lets AI use your browser — the one where you're already logged in — to do the clicking and reading for you.

The question isn't whether this is useful. It's whether it's fast enough to bother with.

The three approaches

AI web search

What ChatGPT, Gemini, and Perplexity do by default. Search the public web and summarize.

Great for public information. Useless for anything behind a login.

Screenshot-based control

What Claude's computer use and similar tools do. The AI takes a screenshot, processes the image, decides where to click, clicks, takes another screenshot.

The problem is tokens. Every screenshot costs roughly 1,500 tokens. A 10-step task can burn through 15,000+ tokens just on images.

browser-use, the most popular open-source AI browser agent (78,000+ GitHub stars), hit this exact wall. They dropped Playwright entirely in favor of raw CDP. The abstraction layer added too much latency across thousands of calls per session.

A January 2026 benchmark found the numbers are stark: 114,000 tokens with MCP-based approaches vs. 7,000 tokens with a CDP-based agent for the same 10-step task.

Direct browser control (CDP)

Chrome has a built-in remote control called CDP — Chrome DevTools Protocol. Same technology that powers Chrome's developer tools.

Instead of taking a picture and guessing what's on screen, CDP lets you read the actual text, click exact elements, and extract data directly.

Screenshot-based is like describing a photo of a spreadsheet over the phone. CDP is like handing someone the actual file.

How the script works

A Python script called cdp.py sits between Claude Code and Chrome:

# 1. Launch Chrome + block images/CSS (60-80% faster loads)
cdp.py ensure
cdp.py block

# 2. Restore yesterday's login session
cdp.py cookies_load ~/.cookies/google.json

# 3. Navigate (smart wait — no fixed sleep)
cdp.py navigate "https://search.google.com/search-console"

# 4. What's on this page? (~100 tokens, not 1,500)
cdp.py axtree
→ [navigation] [button "7d"] [button "28d"] [heading "Performance"]

# 5. Click and read
cdp.py click "28d"
cdp.py readable
→ "Performance: 4,820 clicks | 72,100 impressions | 6.7% CTR"

# 6. Save session for next time
cdp.py cookies_save ~/.cookies/google.json

Seven commands. Clean text data. No images processed. No re-login needed next time.

Don't read the whole page. Read only what you need.

The real trick isn't just "text instead of screenshots." It's asking for exactly the data you need:

Command	What it returns	Tokens
`screenshot`	Every pixel, layout, decorations	~1,500
`content`	All visible text on the page	~800
`readable`	Just the main content, no nav/footer	~300
`links`	Only clickable elements	~150
`axtree`	Accessibility tree — semantic structure, roles	~100
`forms`	Only input fields and labels	~80

Same page. Six levels of precision. The accessibility tree is the most compact — it's the semantic structure screen readers use. On a complex page like GitHub, axtree returns ~100 tokens where content returns ~3,200. That's 27x fewer tokens for the same actionable information.

Three more things that make it fast

Resource blocking. Run block before navigating and the browser skips images, CSS, fonts, and video. Pages load 60-80% faster because AI agents don't need visual resources.

Session persistence. Run cookies_save after logging in. Next time, cookies_load restores the session — no re-authentication, no password handling, no OAuth flows. The script remembers your logins.

Smart waits. Navigation uses Chrome's Performance metrics instead of fixed sleeps. The script detects when a page is actually ready — fast pages proceed immediately instead of waiting a full second.

The bigger point

Google's Chrome team shipped an official MCP server for Chrome DevTools Protocol. Chrome 146 is adding a native settings toggle for AI agent access. The platform vendor itself is building toward structured programmatic access, not pixel interpretation.

But the real lesson isn't about CDP vs screenshots.

It's about the speed threshold.

If an AI assistant takes 3 minutes to do what you can do in 30 seconds, you'll just do it yourself. Every time. Doesn't matter how smart the AI is.

Six months of daily AI work taught me this: the bottleneck was never intelligence. It was always speed.

Build for speed first. Intelligence follows.

See the full visual breakdown comparing all three approaches.

Want to see how AI browser automation works with real tools? Book a call — I'll show you the actual workflow.

The Pragmatic Builder

Weekly frameworks and lessons from building with AI agents. No hype, just what works.

No spam. Unsubscribe anytime.