LLM Web Scraper

Overview

The LLM Web Scraper API combines powerful web scraping capabilities with Language Model processing to extract and structure web content intelligently. It can analyze web pages and return structured data based on your specific requirements.

Base URL

POST https://api.yetanotherapi.com/web-scrapper/

Authentication

All requests require an API key passed in the x-api-key header.

Request Headers

Header

Required

Description

Content-Type

Yes

Must be application/json

x-api-key

Yes

Your API authentication key

Request Body

{
    "url": "https://example.com",
    "output_type": "plaintext", //optional
    "use_llm": true,
    "prompt": "Extract product details including name, price, and specifications",
    "openai_key_id": "752724", //optional but recommended
    "use_cache": false, //optional
    "webhook": "https://your-webhook-url.com" //optional
}

Request Parameters

Parameter

Type

Required

Default

Description

url

string

Yes

The URL of the website to scrape

output_type

string

plaintext

Either "plaintext" or "markdown"

use_llm

boolean

Yes

Must be set to true for LLM processing

prompt

string

Yes

Instructions for the LLM about what to extract

openai_key_id

string

null

Optional ID of your registered OpenAI key

use_cache

boolean

false

If true, returns cached result if available

webhook

string

null

URL to receive webhook notification when processing complete

Important Parameter Notes

Cache Behavior
- When use_cache: true, all other parameters except url are ignored
- Returns most recent cached result for the URL
- 404 error if no cache exists
OpenAI Key ID
- Optional parameter
- If provided, uses the specified OpenAI key from your account
- If not provided, uses your most recently added OpenAI key
- Manage multiple keys through your yetanotherapi dashboard
Webhook
- Optional callback URL for asynchronous processing
- Receives full results when processing completes
- Must be publicly accessible HTTPS endpoint

Responses

Immediate Success Response (HTTP 200)

When processing completes within 20 seconds:

{
    "request_id": "550e8400-e29b-41d4-a716-446655440000",
    "url": "https://example.com",
    "status": "completed",
    "timestamp": 1635545600,
    "content": {
        "text": "Extracted text content...",
        "meta": {
            "title": "Page Title",
            "description": "Meta description..."
        },
        "links": [
            {
                "text": "Link text",
                "url": "https://example.com/link",
                "type": "internal"
            }
        ],
        "images": [
            {
                "url": "https://example.com/image.jpg",
                "alt": "Image description"
            }
        ]
    },
    "llm_output": {
        // Structured JSON based on prompt
    }
}

Processing Response (HTTP 202)

When processing takes longer than 20 seconds:

{
    "request_id": "550e8400-e29b-41d4-a716-446655440000",
    "url": "https://example.com",
    "status": "processing",
    "message": "Processing your request. Please check status later."
}

Error Response (HTTP 4XX/5XX)

{
    "error": "ERROR_CODE: Error message"
}

Error Codes

Code

Description

HTTP Status

E001

Invalid request format

400

E003

Invalid URL format

400

E004

Authentication error

401

E008

Content processing failed

500

E009

Validation error

400

Webhook Integration

When providing a webhook URL, you'll receive a POST request with the complete results:

{
    "request_id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "completed",
    "url": "https://example.com",
    "timestamp": 1635545600,
    "content": {
        // Scraped content
    },
    "llm_output": {
        // LLM processed data
    }
}

PreviousStatus Check NextBasic Text

Last updated 7 months ago

Was this helpful?