Browser Use Agent and Script Generation

This document describes the Browser Use Agent system, which generates structured JSON action plans for browser automation. The system accepts natural language goals and optional DOM context, then uses an LLM to produce validated, executable browser automation scripts. These scripts are consumed by the Browser Extension (see 5) to perform automated web interactions.

For conversational AI agent capabilities with dynamic tool selection, see React Agent Architecture. For the browser extension that executes these generated scripts, see Browser Extension.

Purpose and Architecture

The Browser Use Agent is a stateless script generation service that translates user goals into structured action sequences. Unlike the React Agent which engages in multi-turn reasoning with tool calls, the Browser Use Agent performs single-shot generation of complete automation plans.

Sources: routers/browser_use.py1-51 services/browser_use_service.py1-96 prompts/browser_use.py1-138 utils/agent_sanitizer.py1-119

Request Flow and API Contract

The Browser Use Agent exposes a single endpoint at /generate-script that accepts a GenerateScriptRequest and returns a GenerateScriptResponse.

Request Model

Field	Type	Required	Description
`goal`	string	Yes	Natural language description of the automation task
`target_url`	string	No	Starting URL for the automation (default: "")
`dom_structure`	dict	No	Parsed DOM information from the current page
`constraints`	dict	No	Additional constraints or parameters

The dom_structure dictionary, when provided, contains:

url: Current page URL
title: Page title
interactive: Array of interactive elements with attributes (tag, id, class, type, placeholder, name, ariaLabel, text)

Sources: models/requests/agent.py1-10

Response Model

Field	Type	Description
`ok`	bool	Whether generation succeeded
`action_plan`	dict	Structured JSON action plan (if successful)
`error`	string	Error message (if failed)
`problems`	list[string]	Validation problems (if validation failed)
`raw_response`	string	Truncated LLM response for debugging (if validation failed)

Sources: models/response/agent.py1-11

Endpoint Implementation

The router at routers/browser_use.py16-51 performs initial validation, then delegates to AgentService. The service returns a dictionary that the router transforms into a GenerateScriptResponse. The router distinguishes between validation failures (problems present) and general errors.

Sources: routers/browser_use.py16-51

Service Layer: AgentService

The AgentService class in services/browser_use_service.py implements the core script generation logic in its generate_script method.

DOM Structure Formatting

The service formats the dom_structure dictionary into a human-readable text block for the LLM prompt:

=== PAGE INFORMATION ===
URL: [url]
Title: [title]

=== INTERACTIVE ELEMENTS (N found) ===

1. input id="email" type="email" placeholder="Enter email"
   Text: [text content]

2. button class="submit-btn" type="submit"
   Text: Submit

The service limits interactive elements to 30 to avoid exceeding token limits services/browser_use_service.py34-51 Each element displays relevant attributes: tag, id, class, type, placeholder, name, ariaLabel, and truncated text content.

Sources: services/browser_use_service.py22-52

Prompt Construction

The service constructs a detailed user prompt that includes:

Goal and Context: The user's goal, target URL, and constraints
DOM Information: Formatted interactive elements (if provided)
Action Type Guidance: Instructions to analyze whether the goal requires DOM actions, tab control actions, or both
Search-Specific Instructions: Critical guidance for handling search queries using direct URL construction

The prompt explicitly warns against opening chrome://newtab or about:blank and then attempting DOM actions, as these pages do not support scripting services/browser_use_service.py53-69

Sources: services/browser_use_service.py53-69

LLM Invocation

The service uses a LangChain chain composition pattern:

chain = SCRIPT_PROMPT | llm
response = await chain.ainvoke({"input": user_prompt})

The SCRIPT_PROMPT is a ChatPromptTemplate that combines system instructions with the user prompt. The llm object is the global LLM instance from core.llm services/browser_use_service.py74-77 The chain asynchronously invokes the LLM and extracts the content from the response.

Sources: services/browser_use_service.py72-77

Prompt Engineering

The SCRIPT_PROMPT in prompts/browser_use.py is a comprehensive ChatPromptTemplate that provides detailed instructions for the LLM. The prompt is structured in multiple sections:

Action Categories

The prompt defines two distinct action categories:

Sources: prompts/browser_use.py14-27 utils/agent_sanitizer.py4-17

JSON Format Examples

The prompt provides concrete examples for common scenarios:

DOM Action Example: Typing into a textarea using a specific selector
Tab Control Example: Opening a new tab with a search URL
Search Example (Preferred): Direct URL construction for search queries
Combined Example: Opening a real website followed by DOM interactions

Each example demonstrates proper JSON structure with required fields (type, selector, value, url, etc.) and optional description fields prompts/browser_use.py28-88

Sources: prompts/browser_use.py28-88

Critical Rules

The prompt defines critical rules in four categories:

Rule Category	Key Points
Intent Analysis	Distinguish between tab control needs vs. DOM interaction needs
DOM Action Rules	Study DOM structure carefully; prefer IDs > data attributes > classes; never use DOM actions on `chrome://` URLs
Tab Control Rules	Specify required/optional fields for each tab control action type
Search Handling	Critical: Construct full search URL in `OPEN_TAB` action; never open blank tab then type (fails on `chrome://` pages)

The search handling rules are emphasized as critical because attempting DOM actions on chrome://newtab will fail prompts/browser_use.py89-116

Sources: prompts/browser_use.py89-119

Selector Strategy

The prompt instructs the LLM to use the provided DOM structure to craft precise selectors, with a preference hierarchy:

IDs > data attributes > specific classes > tag+type combinations

It also recommends using placeholder, name, or aria-label attributes when available for more robust selection prompts/browser_use.py95-99

Sources: prompts/browser_use.py95-99

Action Validation and Sanitization

The sanitize_json_actions function in utils/agent_sanitizer.py performs comprehensive validation of the LLM-generated action plan.

Validation Process

Sources: utils/agent_sanitizer.py20-96

Validation Rules

The validator enforces different rules based on action type:

Action Type	Required Fields	Additional Validation
`CLICK`, `TYPE`, `SELECT`	`selector`	`TYPE` also requires `value`
`EXECUTE_SCRIPT`	`script`	Checks for dangerous patterns: `eval(`, `new Function`, `innerHTML =`, `outerHTML =`
`OPEN_TAB`, `NAVIGATE`	`url`	-
`SWITCH_TAB`	`tabId` OR `direction`	-
`CLOSE_TAB`, `RELOAD_TAB`, `DUPLICATE_TAB`	-	Fields are optional

The validator maintains two constant lists defining valid action types:

DOM_ACTIONS utils/agent_sanitizer.py5: Actions requiring page context
TAB_CONTROL_ACTIONS utils/agent_sanitizer.py8-15: Browser-level actions

Sources: utils/agent_sanitizer.py4-90

Security Checks

For EXECUTE_SCRIPT actions, the validator performs basic security checks for dangerous patterns utils/agent_sanitizer.py63-74:

dangerous = ["eval(", "new Function", "innerHTML =", "outerHTML ="]

If any of these patterns are detected in the script, a problem is added to the validation results. This provides a basic layer of protection against code injection, though the primary security boundary is the browser extension's execution context.

Sources: utils/agent_sanitizer.py63-74

Response Generation and Error Handling

The service returns structured responses that the router transforms into GenerateScriptResponse objects.

Success Response

When validation succeeds, the service returns:

{
    "ok": True,
    "action_plan": {
        "actions": [
            {"type": "CLICK", "selector": "...", "description": "..."},
            # ... more actions
        ]
    }
}

Sources: services/browser_use_service.py91

Validation Failure Response

When the action plan fails validation, the service returns:

{
    "ok": False,
    "error": "Action plan failed validation.",
    "problems": [
        "Action 0: missing 'selector' field",
        "Action 2: invalid type 'INVALID_ACTION'"
    ],
    "raw_response": "[first 1000 chars of LLM response]"
}

The raw_response is truncated to 1000 characters for debugging purposes services/browser_use_service.py83-89

Sources: services/browser_use_service.py82-89

Exception Response

When an exception occurs during generation, the service returns:

{
    "ok": False,
    "error": "[exception message]"
}

The service logs the full exception with traceback using logger.exception() services/browser_use_service.py93-95

Sources: services/browser_use_service.py93-95

Router Error Handling

The router distinguishes between different error types and returns appropriate HTTP status codes:

However, the current implementation at routers/browser_use.py32-44 returns the GenerateScriptResponse with the error fields populated rather than raising HTTP exceptions for validation errors. This allows clients to programmatically access the problems array.

Sources: routers/browser_use.py20-50

Integration with Browser Extension

The generated action plans are consumed by the Browser Extension's background script, which executes the actions sequentially. The extension interprets the JSON action plan and dispatches each action to the appropriate handler.

For DOM actions (CLICK, TYPE, SCROLL, WAIT, SELECT, EXECUTE_SCRIPT), the extension uses browser.scripting.executeScript to inject and execute code in the page context. For tab control actions (OPEN_TAB, CLOSE_TAB, SWITCH_TAB, NAVIGATE, RELOAD_TAB, DUPLICATE_TAB), the extension uses browser.tabs API methods.

See Browser Extension for details on how the extension executes these generated scripts.

Sources: Based on high-level architecture diagrams; specific extension implementation details are in the Browser Extension section

LLM Provider Flexibility

The Browser Use Agent uses the global llm instance from core.llm services/browser_use_service.py4 which is a LargeLanguageModel instance that abstracts over multiple providers. The system can use any configured provider (Google Gemini, OpenAI, Anthropic, Ollama, Deepseek, OpenRouter) without changes to the Browser Use Agent code.

See LLM Integration Layer for details on the multi-provider abstraction.

Sources: services/browser_use_service.py4 high-level architecture diagrams

Agentic Browser

Getting Started

Python Backend Api

Agent Intelligence System

Browser Extension

Data Models And Api Contracts