Requests-HTML is a Python library that make parsing HTML and scraping the web easily.
I used to use BeautifulSoup when scraping before, but I found that Requests-HTML is much easier than it. Especially it makes me happy if I need to parse rendered pages!
This time I listed the basic usages for those who have never used it, or not used to it.
Basic Usage
We often use HTMLSession.
1 2 3 4
from requests_html import HTMLSession
session = HTMLSession() r = session.get('https://example.com/?page=1')
Render JavaScript
When you want to parse render HTML, you must run render() function after assigning session.get() to variable (‘r’ here).
*By the way, if you’re using WSL (Windows Subsystem for Linux) you must activate XServer such as GWSL, VcXsrv in advance because the library uses Chromium for rendering. I don’t mention in this post so please googling if you don’t know.
1
r.html.render()
1 2
# Avoid TimeOutError r.html.render(sleep=20)
1 2
# Scroll down to the bottom of the page r.html.render(scrolldown=2000, sleep=0)