# LLM Web Scraper

### Overview

The LLM Web Scraper API combines powerful web scraping capabilities with Language Model processing to extract and structure web content intelligently. It can analyze web pages and return structured data based on your specific requirements.

### Base URL

<pre><code><strong>POST https://api.yetanotherapi.com/web-scrapper/
</strong></code></pre>

### Authentication

All requests require an API key passed in the `x-api-key` header.

### Request Headers

| Header       | Required | Description                 |
| ------------ | -------- | --------------------------- |
| Content-Type | Yes      | Must be `application/json`  |
| x-api-key    | Yes      | Your API authentication key |

### Request Body

```json
{
    "url": "https://example.com",
    "output_type": "plaintext", //optional
    "use_llm": true,
    "prompt": "Extract product details including name, price, and specifications",
    "openai_key_id": "752724", //optional but recommended
    "use_cache": false, //optional
    "webhook": "https://your-webhook-url.com" //optional
}
```

#### Request Parameters

| Parameter       | Type    | Required | Default   | Description                                                  |
| --------------- | ------- | -------- | --------- | ------------------------------------------------------------ |
| url             | string  | Yes      | -         | The URL of the website to scrape                             |
| output\_type    | string  | No       | plaintext | Either "plaintext" or "markdown"                             |
| use\_llm        | boolean | Yes      | -         | Must be set to true for LLM processing                       |
| prompt          | string  | Yes      | -         | Instructions for the LLM about what to extract               |
| openai\_key\_id | string  | No       | null      | Optional ID of your registered OpenAI key                    |
| use\_cache      | boolean | No       | false     | If true, returns cached result if available                  |
| webhook         | string  | No       | null      | URL to receive webhook notification when processing complete |

## Important Parameter Notes

1. **Cache Behavior**
   * When `use_cache: true`, all other parameters except `url` are ignored
   * Returns most recent cached result for the URL
   * 404 error if no cache exists
2. **OpenAI Key ID**
   * Optional parameter
   * If provided, uses the specified OpenAI key from your account
   * If not provided, uses your most recently added OpenAI key
   * Manage multiple keys through your [yetanotherapi dashboard](https://app.yetanotherapi.com/integration)
3. **Webhook**
   * Optional callback URL for asynchronous processing
   * Receives full results when processing completes
   * Must be publicly accessible HTTPS endpoint

### Responses

#### Immediate Success Response (HTTP 200)

When processing completes within 20 seconds:

```json
{
    "request_id": "550e8400-e29b-41d4-a716-446655440000",
    "url": "https://example.com",
    "status": "completed",
    "timestamp": 1635545600,
    "content": {
        "text": "Extracted text content...",
        "meta": {
            "title": "Page Title",
            "description": "Meta description..."
        },
        "links": [
            {
                "text": "Link text",
                "url": "https://example.com/link",
                "type": "internal"
            }
        ],
        "images": [
            {
                "url": "https://example.com/image.jpg",
                "alt": "Image description"
            }
        ]
    },
    "llm_output": {
        // Structured JSON based on prompt
    }
}
```

#### Processing Response (HTTP 202)

When processing takes longer than 20 seconds:

```json
{
    "request_id": "550e8400-e29b-41d4-a716-446655440000",
    "url": "https://example.com",
    "status": "processing",
    "message": "Processing your request. Please check status later."
}
```

#### Error Response (HTTP 4XX/5XX)

```json
{
    "error": "ERROR_CODE: Error message"
}
```

### Error Codes

| Code | Description               | HTTP Status |
| ---- | ------------------------- | ----------- |
| E001 | Invalid request format    | 400         |
| E003 | Invalid URL format        | 400         |
| E004 | Authentication error      | 401         |
| E008 | Content processing failed | 500         |
| E009 | Validation error          | 400         |

### Webhook Integration

When providing a webhook URL, you'll receive a POST request with the complete results:

```json
{
    "request_id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "completed",
    "url": "https://example.com",
    "timestamp": 1635545600,
    "content": {
        // Scraped content
    },
    "llm_output": {
        // LLM processed data
    }
}
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.yetanotherapi.com/llm-web-scraper.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
