Parsing News Data from HTML Using Groq: A Step-by-Step Guide

In the era of big data, extracting structured information from unstructured web content is a crucial skill. Large Language Models (LLMs), such as those provided by Groq, have revolutionized the way we handle and process textual data. These models can understand context, generate human-like text, and extract specific information with impressive accuracy. In this blog, we’ll explore how to leverage Groq’s capabilities alongside the LXML library to parse news articles and product details from HTML pages.

Why Use LLMs and LXML for Data Extraction?

LLMs like Groq’s offer robust text generation and completion models, capable of understanding and manipulating language with a high degree of accuracy. These models can be fine-tuned with parameters such as temperature, seed, and top-p to control the randomness and diversity of the generated text. The LXML library, renowned for its efficient HTML and XML parsing capabilities, complements these models by providing the tools to clean and structure raw HTML content.

Key Parameters for Fine-Tuning LLMs

  • Temperature: Controls the randomness of the text generation. Lower values (closer to 0) make the output more deterministic, while higher values (closer to 1) increase randomness.
  • Seed: Ensures reproducibility of the generated text by setting a specific starting point for the random number generator.
  • Top-p: Controls the diversity of the generated text by setting a probability threshold for token selection. Higher values result in more diverse outputs.

These parameters allow us to fine-tune the model’s output to meet specific needs, ensuring accurate and consistent results.

Setting Up a Groq Account and Generating an API Key

Before diving into the code, you need to set up a Groq account and generate an API key. Here’s how you can do it:

🔸Sign Up for Groq:

  • Visit the Groq website at Groq AI.
  • Click on the “Sign Up” button and fill in the required details to create an account.

🔸Generate an API Key:

  • Once logged in, navigate to the API Keys section in your account dashboard.
  • Click on “Generate New API Key”.
  • Copy the generated API key and store it securely. You will use this key to authenticate your API requests.

Setting Up the Environment

First, make sure you have the necessary libraries installed. You can install them using pip:

pip install groq lxml scrapy requests

Data Points for Extraction

We define specific data points to extract from news articles and product listings. Here are the dictionaries outlining our target data points:

news_data_points = {
    "news_title": "Title of the News",
    "news_date": "Date of the News format(mm/dd/yyyy)",
    "news_content": "Extract all text located below the news title",
    "news_section": "Category of the News",
    "news_tags": "Tags of the news, provided as an array",
}

product_data_points = {
    "product_name": "Name of the Product",
    "price": "Price of the Product",
    "description": "Description of the Product",
    "category": "Category of the Product",
    "image_urls": ["URL1", "URL2", "..."],
}

These dictionaries serve as blueprints for the data we aim to extract.

Removing HTML Tags

To ensure we work with clean text, we need to strip away any HTML tags. Here’s a function that uses LXML’s Cleaner class for this purpose:

from lxml.html.clean import Cleaner

def remove_html_tags(text: str) -> str:
    """
    Removes HTML tags from the given text.

    Args:
        text (str): The input text containing HTML tags.

    Returns:
       str: Text with HTML tags removed.
    """
    cleaner = Cleaner()
    cleaner.javascript = True # Remove JavaScript
    cleaner.style = True # Remove styles and stylesheets
    cleaner.remove_unknown_tags = True
    cleaner.scripts = True
    cleaner.comments = True
    cleaner.inline_style = True
    return cleaner.clean_html(text)

Generating Prompts for Groq

Next, we need a function to generate prompts for Groq’s text completion models. This involves crafting a prompt that guides the model to extract the desired data points from the given text.

from scrapy.selector import Selector

def prompt_generation(response_text: str, flag: str) -> str:
    """
    Generates a prompt for text completion based on the given response text.

    Args:
        response_text (str): The input text to generate completion from.
        flag (str): A flag indicating the type of data ("news" or "product").

    Returns:
       str: The generated prompt for text completion.
    """
    response_text = remove_html_tags(response_text)
    response_selector = Selector(text=response_text)
    response_str = response_selector.xpath("string(//body)").get().strip()
    if flag == "product":
        prompt = (
            f"Given response must be in JSON format with the following data points: 
{product_data_points}. "
            f"Extract the following details from the provided text: {response_str}. "
            f"If any of the specified fields are not present in the text, return 
'None' for that field."
            f"If there are spaces in the key names, replace them with underscores."
       )
   elif flag == "news":
       prompt = (
            f"Given response must be in JSON format with the following data points: 
{news_data_points}. "
            f"Extract the following details from the provided text: {response_str}. "
            f"If any of the specified fields are not present in the text, return 
'None' for that field."
            f"If there are spaces in the key names, replace them with underscores."
       )
   return prompt

Extract New Data with Groq's Powerful Parsing Capabilities.

Parsing Data with Groq

Finally, let’s put everything together in a function that uses Groq’s API to generate text completions based on our prompts.

from groq import Groq
import groq

def parse_anything(
    api_key: str,
    response_text: str,
    flag: str = "news",
    temperature: float = 0.2,
    seed: int = 10,
    model: str = "mixtral-8x7b-32768"
) -> dict:
    """
    Generates text completion based on the given response text using the specified model.

    This function provides access to various models for processing text data. By default, it utilizes the 'mixtral-8x7b-32768' model.

    Available models:
        - 'gemma-7b-it'
        - 'llama2-70b-4096'
        - 'llama3-70b-8192'
        - 'llama3-8b-8192'
        - 'mixtral-8x7b-32768'

    Args:
        api_key (str): The API key for authentication (Required).
        response_text (str): The input text to generate completion from.
        flag (str): The flag indicating the type of data (default is "news").
        temperature (float): Controls the randomness of the generation process 
(default is 0.2).
        model (str): The model to use for text completion (default is 
"mixtral-8x7b-32768").
        seed (int): Seed value to ensure reproducibility of generated text (default is 10).

    Returns:
        dict: A dictionary containing the generated text completion.
    """
    prompt = prompt_generation(response_text=response_text, flag=flag)

    try:
       client = Groq(api_key=api_key)
       chat_completion = client.chat.completions.create(
           messages=[{"role": "user", "content": prompt}],
           model=model,
           temperature=temperature,
           seed=seed,
           top_p=0.8 # Controls the diversity of generated text by setting the probability threshold for token selection.
       )

       return chat_completion.choices[0].message.model_dump_json()

    except groq.APIConnectionError as e:
       print("The server could not be reached")
       print(e.__cause__)
    except groq.RateLimitError as e:
       print("A 429 status code was received; we should back off a bit.")
    except groq.APIStatusError as e:
       print("Another non-200-range status code was received")
       print(e.status_code)
       print(e.response)

Running the Function with a Sample News URL

To demonstrate how to use the parse_anything function, we’ll create another script that fetches HTML content from a news URL and passes it to our parsing function.

First, ensure you have the requests library installed:

Then, create a new Python file, for example, run_parser.py:

import requests
from your_script_name import parse_anything

# Replace with your actual API key
API_KEY = 'your_groq_api_key'

def fetch_html(url: str) -> str:
    """
    Fetches HTML content from the given URL.

    Args:
        url (str): The URL of the webpage to fetch.

    Returns:
        str: The HTML content of the webpage.
    """
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        response.raise_for_status()

def main():
    # Sample news URL
    url = "https://example-news-website.com/sample-a"

    # Fetch HTML content
    html_content = fetch_html(url)

    # Parse the content
    parsed_data = parse_anything(
        api_key=API_KEY,
        response_text=html_content,
        flag="news"
    )

    print(parsed_data)

if __name__ == "__main__":
    main()
coma

Conclusion

This guide shows how to extract structured data from HTML using Groq and LXML. By combining Groq’s language models with LXML’s parsing, users can accurately parse news articles and product details. Adjusting parameters like temperature ensures customized outputs. The guide also explains how to set up Groq accounts and generate API keys. With practical examples, it enables automated data extraction for informed decisions.

Keep Reading

Keep Reading

  • Service
  • Career
  • Let's create something together!

  • We’re looking for the best. Are you in?