Skip to main content

2 posts tagged with "AI Tools"

View All Tags

Lightfeed Extract Now Open Source

· 4 min read
Lightfeed Team

We're excited to announce that we've open-sourced lightfeed-extract — our LLM-powered web data extraction library that's been successfully processing over 10 million records in production.

Along with this milestone, we are also excited to introduce several new features to the Lightfeed platform that enable you to monitor value changes, extract with greater accuracy and receive timely notifications.

Open Sourcing Lightfeed Extract

While working with LLMs for structured web data extraction, we encountered challenges with invalid JSON and broken links in the output. This led us to build a robust library focused on reliable extraction and enrichment with features like:

  • Clean HTML conversion: transforms HTML into LLM-friendly markdown with main content extraction
  • LLM structured output: leverages latest LLMslike Gemini 2.5 flash to balance accuracy and cost
  • JSON sanitization: recovers and fixes imperfect LLM outputs to match your schema
  • URL validation: automatically handles relative URLs, removes invalid ones, and repairs markdown-escaped links

GitHub: github.com/lightfeed/lightfeed-extract

// Example usage
import { extract, ContentFormat } from "lightfeed-extract";
import { z } from "zod";

// Define your schema. We will run one more sanitization process to
// recover imperfect, failed, or partial LLM outputs into this schema
const schema = z.object({
title: z.string(),
author: z.string().optional(),
tags: z.array(z.string()),
// URLs get validated automatically
links: z.array(z.string().url()),
summary: z
.string()
.describe("A brief summary of the article content within 500 characters"),
});

// Run the extraction
const result = await extract({
content: htmlString,
format: ContentFormat.HTML,
schema,
sourceUrl: "https://example.com/article",
googleApiKey: "your-google-gemini-api-key",
});

console.log(result.data);

Value History Tracking

Our new Value History Tracking feature makes it easy to spot changes in your data at a glance. When viewing extraction results in tables or lists, any field that's been updated is automatically highlighted with visual indicators showing both the current value and what it changed from.

This is particularly useful for monitoring price fluctuations, inventory changes, or any time-sensitive data.

Value History Tracking showing highlighted changes between extractions

List Mode vs Detail Mode

When adding a website to Lightfeed, you can now specify whether you're extracting a list of items or a single detailed item. This simple choice guides our AI to produce more accurate results by providing clear context to the LLM about the extraction task.

List Mode optimizes extraction for pages containing multiple similar items (like product listings, search results, or directories), helping maintain consistent structure across all entries and preventing the LLM from getting distracted by unrelated page elements.

Detail Mode is designed for pages focused on a single item (such as product pages, company profiles, or articles), instructing the LLM to capture comprehensive information including nested details and relationships that might be missed otherwise.

By selecting the appropriate mode for your target page, you'll achieve more accurate extractions without writing custom logic, as the LLM receives precise guidance about what to look for and how to structure the output.

Learn more about List Mode and Detail Mode in our documentation →

Email Notifications

You can now choose to receive extraction results directly in your inbox with three flexible options:

  1. Receive only new items that appear for the first time
  2. Get both new and changed items
  3. Receive the complete dataset of each extraction run

This makes it simple to monitor exactly what matters to you — whether you're tracking new listings, want to know when prices change, or need a complete record of each extraction.

Learn more about Email Notifications in our documentation →

Join Our Community

We're excited to see how you'll use Lightfeed in your projects, so don't hesitate to reach out!

Introducing Lightfeed Extract

· 3 min read
Lightfeed Team

We're thrilled to launch Lightfeed Extract — a powerful, business-grade web data extraction tool that turns any website into clean, structured, and up-to-date data — all from a simple prompt.

Lightfeed Extract

Say goodbye to custom scrapers, brittle workflows, and writing code. Lightfeed handles the heavy lifting, and even better — we keep your data fresh in a continuously maintained, queryable database.

The Web Data Challenge

If you need clean structured data from websites - whether tracking competitors, monitoring pricing trends, extracting business intelligence, training AI models, or powering applications - you're probably familiar with the limitations of existing tools:

Common Extraction Pain Points

  • Manual Scraping and Maintenance: Traditional scrapers require custom code for each website and break when layouts change - forcing teams to constantly rewrite and fix code instead of focusing on business goals.

  • Limited Extraction Depth: Most tools only extract data from specified URLs, missing critical information buried in subpages and linked content.

  • No Integrated Database: Most scrapers don't provide a persistent database — forcing slow, repeated website crawling for each data request instead of fast queries, and making it impossible to track changes, search historic data, or quickly find relevant information.

  • Data Quality Issues: Raw extracted data requires significant post-processing to clean, normalize, and deduplicate - creating additional engineering complexity and introducing potential errors.

  • Anti-Scraping Measures: Modern websites implement various protection mechanisms - including CAPTCHAs, request throttling, and automated bot detection - making reliable data collection increasingly challenging.

The Lightfeed Solution

Lightfeed transforms how organizations extract and maintain clean, structured and up-to-date web data at scale. Our platform leverages Large Language Models (LLMs) and AI agents that can read, understand and interact with website content, making data extraction reliable and fully automated.

Key Benefits

Adaptive AI Extraction

Extract data from any website using simple natural language instructions without writing code. Automatically adapt to website changes.

Deep Content Discovery and Enrichment

Automatically extract data from linked pages and subpages, while enriching information from multiple sources and third-party websites to create comprehensive datasets.

Fast Database Access

Access consistently up-to-date structured data through instant queries instead of slow crawling, with built-in AI search capabilities to track changes and find the most relevant information.

Automated Data Processing

Get clean, normalized data with automatic deduplication and formatting.

Reliable Scraping

Extract data consistently even from the hardest websites—solving CAPTCHAs automatically and using premium proxies to bypass anti-bot measures.

Getting Started with Lightfeed Extract

Ready to transform how you extract structured data from the web?