beautiful soup

Beautiful Soup Tutorial: Simplifying Web Scraping with Python

January 09, 20242 min read

Web scraping can seem daunting, but with Python and its libraries like Beautiful Soup, it becomes much more approachable. This guide focuses on using Beautiful Soup to parse HTML and XML documents, essential for web scraping. We'll cover the basics, from installation to parsing a sample HTML file, and even delve into handling dynamic pages with Selenium.

What is Data Parsing?

Data parsing transforms raw data (like HTML) into a more readable and analyzable format. It's a key step in web scraping, turning the complex web page structure into something we can understand and work with.

Understanding Parsers and Beautiful Soup

A parser sifts through HTML, picking out relevant information based on set rules. Beautiful Soup, a Python library, excels at this. It creates a parse tree from web pages, making it easier to extract, navigate, and modify HTML data. Suitable for Python 3.6 and up, it's a time-saver for data collection and parsing.

1. Installing Beautiful Soup

Before starting, ensure Python is installed on your system. Open your terminal or command prompt and install Beautiful Soup using pip:

pip install beautifulsoup4

2. Understanding Your HTML Target

Consider a basic HTML document for practice:

<!DOCTYPE html>
<html>
<head>
    <title>Exploring Web Scraping</title>
    <meta charset="utf-8">
</head>
<body>
    <h2>Web Scraping Techniques</h2>
    <p>Web scraping involves extracting data from websites...</p>
    <ul id="techniques">
        <li>HTML Parsing</li>
        <li>API Requests</li>
        <li>Dynamic Content Handling</li>
    </ul>
</body>
</html>

Save this in a file named example.html.

3. Extracting HTML Tags

Use Beautiful Soup to list HTML tags:

from bs4 import BeautifulSoup

with open('example.html', 'r') as file:
    html_content = file.read()
    soup = BeautifulSoup(html_content, "html.parser")

    for tag in soup.descendants:
        if tag.name:
            print(tag.name)

4. Content Extraction from Tags

To display the content within specific HTML tags:

print(soup.h2)
print(soup.p)
print(soup.find('li'))

Use .text to get only the text part of the tags.

5. Finding Elements by ID

To find an element by ID:

print(soup.find('ul', id='techniques'))

6. Extracting All Instances of a Tag

To list all <li> tags:

for li in soup.find_all('li'):
    print(li.text)

7. Using CSS Selectors

Use CSS selectors to target elements:

print(soup.select('head > title'))
print(soup.select_one('body > ul#techniques > li:first-of-type'))

8. Parsing Dynamic Elements with Selenium

For dynamic content, use Selenium alongside Beautiful Soup:

Note:make sure you have chrome driver in the same directory

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get("https://en.wikipedia.org/wiki/NIFTY_50")
dynamic_content = driver.page_source
print(dynamic_content)

with open('content.html', 'w') as file:
    file.write(dynamic_content)

9. Exporting Data to CSV

Export parsed data using pandas:

with open('content.html', 'r') as file:
    dynamic_content = file.read()

soup = BeautifulSoup(dynamic_content, "html.parser")
id='toc-Annual_returns'
# find li tag wih id toc-Annual_returns
dynamic_element = soup.find('li', id=id)
print(dynamic_element.text)

# find all li tags
techniques = [li.text for li in soup.find_all('li')]
df = pd.DataFrame({'Web Scraping Techniques': techniques})
df.to_csv('web_scraping_techniques.csv', index=False)

This code snippet demonstrates how to use Beautiful Soup for web scraping, from installation to exporting data to a CSV file.

blog author image

sunil s

Quant Developer & Mentor

Back to Blog