Web Scraping using Requests-HTML on Python

2022-06-14

Requests-HTML is a Python library that make parsing HTML and scraping the web easily.

I used to use BeautifulSoup when scraping before, but I found that Requests-HTML is much easier than it.
Especially it makes me happy if I need to parse rendered pages!

This time I listed the basic usages for those who have never used it, or not used to it.


Basic Usage

We often use HTMLSession.

1
2
3
4
from requests_html import HTMLSession

session = HTMLSession()
r = session.get('https://example.com/?page=1')

Render JavaScript

When you want to parse render HTML, you must run render() function after assigning session.get() to variable (‘r’ here).

*By the way, if you’re using WSL (Windows Subsystem for Linux) you must activate XServer such as GWSL, VcXsrv in advance because the library uses Chromium for rendering.
I don’t mention in this post so please googling if you don’t know.

1
r.html.render()
1
2
# Avoid TimeOutError
r.html.render(sleep=20)
1
2
# Scroll down to the bottom of the page
r.html.render(scrolldown=2000, sleep=0)
1
2
3
4
5
6
7
# Run script
script = '''
() => {
alert("Render!");
}
'''
r.html.render(script=script)

Find Elements

It is very easy to extract elements that you desired.
Just use find().

1
2
# List all elements
a_tags = r.html.find('a')
1
2
# Get first element
a_tag = r.html.find('a', first=True)
1
2
# Specify id name
a_id_example = r.html.find('a#example')
1
2
# Specify class name
a_class_example = r.html.find('a.example')

Attributes

You are able to get attributes such as ‘href’, ‘class’.

1
2
3
4
a_tag.attrs
a_tag.attrs['class']
a_tag.attrs['href']
a_tag.attrs['id']

Inner Text

Get contents of elemenet.

1
a_tag.text