In the era of big data, extracting structured information from unstructured web content is a crucial skill. Large Language Models (LLMs), such as those provided by Groq, have revolutionized the way we handle and process textual data. These models can understand context, generate human-like text, and extract specific information with impressive accuracy. In this blog, we’ll explore how to leverage Groq’s capabilities alongside the LXML library to parse news articles and product details from HTML pages.
LLMs like Groq’s offer robust text generation and completion models, capable of understanding and manipulating language with a high degree of accuracy. These models can be fine-tuned with parameters such as temperature, seed, and top-p to control the randomness and diversity of the generated text. The LXML library, renowned for its efficient HTML and XML parsing capabilities, complements these models by providing the tools to clean and structure raw HTML content.
These parameters allow us to fine-tune the model’s output to meet specific needs, ensuring accurate and consistent results.
Before diving into the code, you need to set up a Groq account and generate an API key. Here’s how you can do it:
🔸Sign Up for Groq:
🔸Generate an API Key:
First, make sure you have the necessary libraries installed. You can install them using pip:
pip install groq lxml scrapy requests
We define specific data points to extract from news articles and product listings. Here are the dictionaries outlining our target data points:
news_data_points = {
"news_title": "Title of the News",
"news_date": "Date of the News format(mm/dd/yyyy)",
"news_content": "Extract all text located below the news title",
"news_section": "Category of the News",
"news_tags": "Tags of the news, provided as an array",
}
product_data_points = {
"product_name": "Name of the Product",
"price": "Price of the Product",
"description": "Description of the Product",
"category": "Category of the Product",
"image_urls": ["URL1", "URL2", "..."],
}
These dictionaries serve as blueprints for the data we aim to extract.
To ensure we work with clean text, we need to strip away any HTML tags. Here’s a function that uses LXML’s Cleaner class for this purpose:
from lxml.html.clean import Cleaner
def remove_html_tags(text: str) -> str:
"""
Removes HTML tags from the given text.
Args:
text (str): The input text containing HTML tags.
Returns:
str: Text with HTML tags removed.
"""
cleaner = Cleaner()
cleaner.javascript = True # Remove JavaScript
cleaner.style = True # Remove styles and stylesheets
cleaner.remove_unknown_tags = True
cleaner.scripts = True
cleaner.comments = True
cleaner.inline_style = True
return cleaner.clean_html(text)
Next, we need a function to generate prompts for Groq’s text completion models. This involves crafting a prompt that guides the model to extract the desired data points from the given text.
from scrapy.selector import Selector
def prompt_generation(response_text: str, flag: str) -> str:
"""
Generates a prompt for text completion based on the given response text.
Args:
response_text (str): The input text to generate completion from.
flag (str): A flag indicating the type of data ("news" or "product").
Returns:
str: The generated prompt for text completion.
"""
response_text = remove_html_tags(response_text)
response_selector = Selector(text=response_text)
response_str = response_selector.xpath("string(//body)").get().strip()
if flag == "product":
prompt = (
f"Given response must be in JSON format with the following data points:
{product_data_points}. "
f"Extract the following details from the provided text: {response_str}. "
f"If any of the specified fields are not present in the text, return
'None' for that field."
f"If there are spaces in the key names, replace them with underscores."
)
elif flag == "news":
prompt = (
f"Given response must be in JSON format with the following data points:
{news_data_points}. "
f"Extract the following details from the provided text: {response_str}. "
f"If any of the specified fields are not present in the text, return
'None' for that field."
f"If there are spaces in the key names, replace them with underscores."
)
return prompt
Finally, let’s put everything together in a function that uses Groq’s API to generate text completions based on our prompts.
from groq import Groq
import groq
def parse_anything(
api_key: str,
response_text: str,
flag: str = "news",
temperature: float = 0.2,
seed: int = 10,
model: str = "mixtral-8x7b-32768"
) -> dict:
"""
Generates text completion based on the given response text using the specified model.
This function provides access to various models for processing text data. By default, it utilizes the 'mixtral-8x7b-32768' model.
Available models:
- 'gemma-7b-it'
- 'llama2-70b-4096'
- 'llama3-70b-8192'
- 'llama3-8b-8192'
- 'mixtral-8x7b-32768'
Args:
api_key (str): The API key for authentication (Required).
response_text (str): The input text to generate completion from.
flag (str): The flag indicating the type of data (default is "news").
temperature (float): Controls the randomness of the generation process
(default is 0.2).
model (str): The model to use for text completion (default is
"mixtral-8x7b-32768").
seed (int): Seed value to ensure reproducibility of generated text (default is 10).
Returns:
dict: A dictionary containing the generated text completion.
"""
prompt = prompt_generation(response_text=response_text, flag=flag)
try:
client = Groq(api_key=api_key)
chat_completion = client.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
model=model,
temperature=temperature,
seed=seed,
top_p=0.8 # Controls the diversity of generated text by setting the probability threshold for token selection.
)
return chat_completion.choices[0].message.model_dump_json()
except groq.APIConnectionError as e:
print("The server could not be reached")
print(e.__cause__)
except groq.RateLimitError as e:
print("A 429 status code was received; we should back off a bit.")
except groq.APIStatusError as e:
print("Another non-200-range status code was received")
print(e.status_code)
print(e.response)
To demonstrate how to use the parse_anything function, we’ll create another script that fetches HTML content from a news URL and passes it to our parsing function.
First, ensure you have the requests library installed:
Then, create a new Python file, for example, run_parser.py:
import requests
from your_script_name import parse_anything
# Replace with your actual API key
API_KEY = 'your_groq_api_key'
def fetch_html(url: str) -> str:
"""
Fetches HTML content from the given URL.
Args:
url (str): The URL of the webpage to fetch.
Returns:
str: The HTML content of the webpage.
"""
response = requests.get(url)
if response.status_code == 200:
return response.text
else:
response.raise_for_status()
def main():
# Sample news URL
url = "https://example-news-website.com/sample-a"
# Fetch HTML content
html_content = fetch_html(url)
# Parse the content
parsed_data = parse_anything(
api_key=API_KEY,
response_text=html_content,
flag="news"
)
print(parsed_data)
if __name__ == "__main__":
main()
This guide shows how to extract structured data from HTML using Groq and LXML. By combining Groq’s language models with LXML’s parsing, users can accurately parse news articles and product details. Adjusting parameters like temperature ensures customized outputs. The guide also explains how to set up Groq accounts and generate API keys. With practical examples, it enables automated data extraction for informed decisions.
The team at Mindbowser was highly professional, patient, and collaborative throughout our engagement. They struck the right balance between offering guidance and taking direction, which made the development process smooth. Although our project wasn’t related to healthcare, we clearly benefited...
Founder, Texas Ranch Security
Mindbowser played a crucial role in helping us bring everything together into a unified, cohesive product. Their commitment to industry-standard coding practices made an enormous difference, allowing developers to seamlessly transition in and out of the project without any confusion....
CEO, MarketsAI
I'm thrilled to be partnering with Mindbowser on our journey with TravelRite. The collaboration has been exceptional, and I’m truly grateful for the dedication and expertise the team has brought to the development process. Their commitment to our mission is...
Founder & CEO, TravelRite
The Mindbowser team's professionalism consistently impressed me. Their commitment to quality shone through in every aspect of the project. They truly went the extra mile, ensuring they understood our needs perfectly and were always willing to invest the time to...
CTO, New Day Therapeutics
I collaborated with Mindbowser for several years on a complex SaaS platform project. They took over a partially completed project and successfully transformed it into a fully functional and robust platform. Throughout the entire process, the quality of their work...
President, E.B. Carlson
Mindbowser and team are professional, talented and very responsive. They got us through a challenging situation with our IOT product successfully. They will be our go to dev team going forward.
Founder, Cascada
Amazing team to work with. Very responsive and very skilled in both front and backend engineering. Looking forward to our next project together.
Co-Founder, Emerge
The team is great to work with. Very professional, on task, and efficient.
Founder, PeriopMD
I can not express enough how pleased we are with the whole team. From the first call and meeting, they took our vision and ran with it. Communication was easy and everyone was flexible to our schedule. I’m excited to...
Founder, Seeke
We had very close go live timeline and Mindbowser team got us live a month before.
CEO, BuyNow WorldWide
If you want a team of great developers, I recommend them for the next project.
Founder, Teach Reach
Mindbowser built both iOS and Android apps for Mindworks, that have stood the test of time. 5 years later they still function quite beautifully. Their team always met their objectives and I'm very happy with the end result. Thank you!
Founder, Mindworks
Mindbowser has delivered a much better quality product than our previous tech vendors. Our product is stable and passed Well Architected Framework Review from AWS.
CEO, PurpleAnt
I am happy to share that we got USD 10k in cloud credits courtesy of our friends at Mindbowser. Thank you Pravin and Ayush, this means a lot to us.
CTO, Shortlist
Mindbowser is one of the reasons that our app is successful. These guys have been a great team.
Founder & CEO, MangoMirror
Kudos for all your hard work and diligence on the Telehealth platform project. You made it possible.
CEO, ThriveHealth
Mindbowser helped us build an awesome iOS app to bring balance to people’s lives.
CEO, SMILINGMIND
They were a very responsive team! Extremely easy to communicate and work with!
Founder & CEO, TotTech
We’ve had very little-to-no hiccups at all—it’s been a really pleasurable experience.
Co-Founder, TEAM8s
Mindbowser was very helpful with explaining the development process and started quickly on the project.
Executive Director of Product Development, Innovation Lab
The greatest benefit we got from Mindbowser is the expertise. Their team has developed apps in all different industries with all types of social proofs.
Co-Founder, Vesica
Mindbowser is professional, efficient and thorough.
Consultant, XPRIZE
Very committed, they create beautiful apps and are very benevolent. They have brilliant Ideas.
Founder, S.T.A.R.S of Wellness
Mindbowser was great; they listened to us a lot and helped us hone in on the actual idea of the app. They had put together fantastic wireframes for us.
Co-Founder, Flat Earth
Ayush was responsive and paired me with the best team member possible, to complete my complex vision and project. Could not be happier.
Founder, Child Life On Call
The team from Mindbowser stayed on task, asked the right questions, and completed the required tasks in a timely fashion! Strong work team!
CEO, SDOH2Health LLC
Mindbowser was easy to work with and hit the ground running, immediately feeling like part of our team.
CEO, Stealth Startup
Mindbowser was an excellent partner in developing my fitness app. They were patient, attentive, & understood my business needs. The end product exceeded my expectations. Thrilled to share it globally.
Owner, Phalanx
Mindbowser's expertise in tech, process & mobile development made them our choice for our app. The team was dedicated to the process & delivered high-quality features on time. They also gave valuable industry advice. Highly recommend them for app development...
Co-Founder, Fox&Fork