Web scraping using python selenium

January 11, 2024•6 min read

Selenium is a collection of various open-source projects designed for the automation of web browsers. It is compatible with many major programming languages, Python being one of them, which is highly preferred.

Through its API, Selenium employs the WebDriver protocol to manipulate various web browsers such as Chrome, Firefox, or Safari. It is capable of operating both locally installed browsers and browsers running remotely over a network.

Initially created about two decades ago, Selenium's primary use was for cross-browser, end-to-end testing, including acceptance tests. Over time, it has evolved into a versatile tool for browser automation, widely used for tasks like web crawling and scraping. A real browser is often the most effective tool for interacting with websites.

Selenium offers multiple interaction methods with websites, including:

Clicking on buttons
Filling out forms
Scrolling through pages
Capturing screenshots
Handling login and authentication processes

Installation Guide For our example, we'll use the Chrome browser. Ensure the following are installed:

Chrome (downloadable from the Chrome download page)
ChromeDriver (compatible with your Chrome version)
Selenium Python Binding package

To install Selenium, it's advised to use a virtual environment (e.g., using venv) and run:

pip install selenium

Getting Started After installing Chrome, ChromeDriver, and the Selenium package, you can initiate the browser with:

from selenium import webdriver

CHROMEDRIVER_PATH = '/your/chromedriver/path'
browser = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH)
browser.get('https://google.com')

This code will open a normal Chrome window controlled by Selenium.

Headless Chrome Mode For development, observing browser behavior in a visible window is beneficial. However, in production, it's better to use headless mode, where Chrome runs in the background without a GUI. This saves resources, especially on servers where GUI is not a priority. To enable headless mode:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
# chrome_options.add_argument("--headless")

browser = webdriver.Chrome(options=chrome_options)
browser.get('https://www.google.com/')

This code runs Chrome in the background.

Exploring WebDriver Page Properties Continuing with our headless mode, let's visit the google website:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
# chrome_options.add_argument("--headless")

browser = webdriver.Chrome(options=chrome_options)
browser.get('https://www.google.com/')


#get html of the current page
print(browser.page_source)

#get url of the current page
print(browser.current_url)

#get title of the current page
print(browser.title)

This script will display the HTML of https://www.google.com/. Additionally, WebDriver provides other properties like browser.title for the page's title and browser.current_url for the current URL, particularly useful for sites with redirections. More properties are detailed in the WebDriver documentation.

Element Identification for Web Scraping

Web scraping fundamentally involves pinpointing the exact location of the data you aim to gather. Identifying elements on a webpage is a critical aspect of this process. Selenium, a popular tool for web scraping, is equipped with features to facilitate this. It is essential for testing scenarios as well, ensuring the presence or absence of particular elements on a webpage.

Methods to Identify Web Elements There are several approaches to locate a specific element on a webpage:

By Tag Name: Selecting elements using their tag name.
By HTML Class or ID: Filtering elements through their designated HTML class or ID.
Using CSS Selectors or XPath: Employing CSS selectors or XPath for more precise targeting.

Especially for XPath, it is advisable to gain a comprehensive understanding of how it can be utilized to navigate the DOM tree. There are numerous resources available for beginners to learn about XPath and its application.

Practical Approach to Element Location A practical method to locate an element is to use the Chrome Developer Tools. A quick way to do this is by hovering over the desired element, then pressing Ctrl + Shift + C (or Cmd + Shift + C on macOS), which is faster than the usual right-click and inspect method.

Once the element is identified in the DOM tree, decide the best strategy to programmatically interact with it. For example, you can copy its absolute XPath or CSS selector from the inspector by right-clicking the element.

The find_element Methods in WebDriver WebDriver offers two primary functions to locate elements:

find_element: Retrieves a single element.
find_elements: Returns a list of all matching elements.

Both functions support eight different locator strategies, provided by the By class:

By.ID: Locates elements using their HTML ID.
By.NAME: Finds elements by the name attribute.
By.XPATH: Utilizes XPath expressions for element location.
By.LINK_TEXT: Finds link elements by matching text content.
By.PARTIAL_LINK_TEXT: Locates link elements by a substring of their text content.
By.TAG_NAME: Searches for elements using their tag name.
By.CLASS_NAME: Identifies elements through their HTML classes.
By.CSS_SELECTOR: Locates elements using CSS selectors.

Example of find_element Usage Consider the following HTML snippet:

htmlCopy code<html>
    <head>
        <!-- some content -->
    </head>
    <body>
        <h1 class="someclass" id="greatID">Super title</h1>
    </body>
</html>

To select the <h1> element, any of these five find_element methods would be effective:

pythonCopy codeh1 = driver.find_element(By.NAME, 'h1')
h1 = driver.find_element(By.CLASS_NAME, 'someclass')
h1 = driver.find_element(By.XPATH, '//h1')
h1 = driver.find_element(By.XPATH, '/html/body/h1')
h1 = driver.find_element(By.ID, 'greatID')

For selecting all anchor (<a>) tags on a page, use find_elements:

all_links = driver.find_elements(By.TAG_NAME, 'a')

Choosing XPath for Complex Scenarios In cases where elements lack a unique ID or simple class, or when multiple elements share the same class or ID, XPath is a versatile tool. It excels in locating elements anywhere on a webpage, whether based on their absolute position in the DOM or relative to other elements. XPath is particularly useful for its flexibility and power in complex web scraping tasks.

Selenium's WebElement Explained

In Selenium, a WebElement object represents an HTML element within a webpage. This object allows various interactions with HTML elements. Key operations include:

Retrieving the text of an element using element.text.
Activating an element with element.click().
Fetching an element's attribute via element.get_attribute('class').
Entering text into a text field using element.send_keys('exampleText').

Additionally, methods like is_displayed() are available. This method checks if an element is visible to users, which is particularly useful for detecting honeypots. Honeypots are tactics used by websites to identify bot activities. For instance, consider an HTML input element defined as <input type="hidden" id="exampleId" name="exampleName" value="">. This hidden element should not be filled by genuine users since it's not visible in the browser. Bots, however, might mistakenly fill these, revealing themselves.

Practical Example with Selenium Consider a task where we need to log into a website, such as Hacker News:

Visit the login page with driver.get().
Identify and fill in the username field using driver.find_element and element.send_keys().
Do the same for the password field.
Locate and activate the login button using find_element and element.click().

The process is straightforward. The code snippet for this would be:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time

url='https://www.saucedemo.com/'
browser.get(url)

browser.find_element(By.XPATH,'//*[@id="user-name"]').send_keys('standard_user')
browser.find_element(By.XPATH,'//*[@id="password"]').send_keys('secret_sauce')
browser.find_element(By.XPATH,'//*[@id="login-button"]').click()
time.sleep(5)

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time

url='https://www.saucedemo.com/'
browser.get(url)

browser.find_element(By.XPATH,'//*[@id="user-name"]').send_keys('standard_user')
browser.find_element(By.XPATH,'//*[@id="password"]').send_keys('secret_sauce')
browser.find_element(By.XPATH,'//*[@id="login-button"]').click()

Handling Dynamic Web Content with Selenium

Modern websites frequently use JavaScript-heavy front-ends, utilizing frameworks such as Angular, React, or Vue.js. These frameworks complicate web scraping tasks as they dynamically alter the DOM and perform asynchronous operations, including AJAX requests. This dynamic nature necessitates a different approach to scraping, as the content of interest might not be immediately available upon page load.

Instead of a simple HTTP request, a more nuanced method is required to ensure that the necessary JavaScript has executed and the desired content is available. Two common strategies are:

Implementing a pause using time.sleep().

Using time.sleep() is straightforward but less efficient. You must guess an appropriate wait time, risking either unnecessary delay or insufficient waiting, and performance may vary based on network speed or server response times.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time

url='https://www.saucedemo.com/'
browser.get(url)

browser.find_element(By.XPATH,'//*[@id="user-name"]').send_keys('standard_user')
browser.find_element(By.XPATH,'//*[@id="password"]').send_keys('secret_sauce')
browser.find_element(By.XPATH,'//*[@id="login-button"]').click()

html_code=browser.page_source
if 'Logout' in html_code:
    print('Successfully logged in')
else:
    print('Login failed')

time.sleep(5)
browser.quit()

sunil s

Quant Developer & Mentor

Back to Blog