Blog

Web Scraping using Requests HTML on Python

3D

Requests-HTML is a Python library that make parsing HTML and scraping the web easily.

I used to use BeautifulSoup when scraping before, but I found that Requests-HTML is much easier than it.
Especially it makes me happy if I need to parse rendered pages!

This time I listed the basic usages for those who have never used it, or not used to it.


Basic Usage

We often use HTMLSession.

from requests_html import HTMLSession

session = HTMLSession()
r = session.get('https://example.com/?page=1')

Render JavaScript

When you want to parse render HTML, you must run render() function after assigning session.get() to variable ('r' here).

*By the way, if you're using WSL (Windows Subsystem for Linux) you must activate XServer such as GWSL, VcXsrv in advance because the library uses Chromium for rendering.
I don't mention in this post so please googling if you don't know.

r.html.render()
# Avoid TimeOutError
r.html.render(sleep=20)
# Scroll down to the bottom of the page
r.html.render(scrolldown=2000, sleep=0)
# Run script
script = '''
() => {
alert("Render!");
}
'''

r.html.render(script=script)

Find Elements

It is very easy to extract elements that you desired.
Just use find().

# List all elements
a_tags = r.html.find('a')
# Get first element
a_tag = r.html.find('a', first=True)
# Specify id name
a_id_example = r.html.find('a#example')
# Specify class name
a_class_example = r.html.find('a.example')

Attributes

You are able to get attributes such as 'href', 'class'.

a_tag.attrs
a_tag.attrs['class']
a_tag.attrs['href']
a_tag.attrs['id']

Inner Text

Get contents of elemenet.

a_tag.text