Web Scraping using Requests HTML on Python
By Hideki Ishiguro at
Requests-HTML is a Python library that make parsing HTML and scraping the web easily.
I used to use BeautifulSoup when scraping before, but I found that Requests-HTML is much easier than it.
Especially it makes me happy if I need to parse rendered pages!
This time I listed the basic usages for those who have never used it, or not used to it.
Basic Usage
We often use HTMLSession.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://example.com/?page=1')
Render JavaScript
When you want to parse render HTML, you must run render() function after assigning session.get() to variable ('r' here).
*By the way, if you're using WSL (Windows Subsystem for Linux) you must activate XServer such as GWSL, VcXsrv in advance because the library uses Chromium for rendering.
I don't mention in this post so please googling if you don't know.
r.html.render()
# Avoid TimeOutError
r.html.render(sleep=20)
# Scroll down to the bottom of the page
r.html.render(scrolldown=2000, sleep=0)
# Run script
script = '''
() => {
alert("Render!");
}
'''
r.html.render(script=script)
Find Elements
It is very easy to extract elements that you desired.
Just use find().
# List all elements
a_tags = r.html.find('a')
# Get first element
a_tag = r.html.find('a', first=True)
# Specify id name
a_id_example = r.html.find('a#example')
# Specify class name
a_class_example = r.html.find('a.example')
Attributes
You are able to get attributes such as 'href', 'class'.
a_tag.attrs
a_tag.attrs['class']
a_tag.attrs['href']
a_tag.attrs['id']
Inner Text
Get contents of elemenet.
a_tag.text