Web scraping is an automated process to extract data from web page, and since Python is one the most popular programming languages it’s common to see people use it for doing web scraping tasks like me :)
For a long time, I have been using Beautiful Soup 4 to extract data from web pages’ HTML markup, it’s popular, easy, robust, and battle-tested library for navigating, searching, and modifying the DOM tree. But, recently I came across Parsel, another HTML parsing library that supports XPath selectors, which is missing in Beautiful Soup, and I was in need of using something that can extract data from HTML using XPath (rather than Scrapy, funny enough, later I knew that Scrapy uses Parsel under the hood :D), so I decided to get it a try.
Thoughts and Tricks After Usage
Parsel is so powerful! it saved me much time than bs4 usually did, this is mainly because of the easy way it provides to access DOM Node HTML, text, attributes and other values easily with a handy way to get a default value if the required data didn’t exist.
Parsel uses lxml for parsing the web page, this can result in a huge performance improvement according to selectolax’s benchmark (which is an interesting library to try as well).
Parsel is easy to use, all you need to do is to import
parselpackage then use
Selector(text=response.text)to load HTML string into a selector object.
1# code snippet from https://parsel.readthedocs.io/en/latest/usage.html 2from parsel import Selector 3text = "<html><body><h1>Hello, Parsel!</h1></body></html>" 4selector = Selector(text=text)
.get()are equivalent for
.css()methods can be chained!
::textpseudo-element can be better than Beautiful Soup’s
.textbecause it won’t be
Noneeven with empty text. Same goes for
.get()(without default value).
*::textto selects all descendant text nodes of the current selector.
If you want to use regular expressions (regex) to get data from a string of a selector, all you need to do is to use
.re_first()methods. However, unlike using
.css()methods, regex methods returns a list of strings so they can’t be chained.
The proper way to work with relative XPaths is to prefix the path with dot
drop()can be used to remove elements based on a Selector, this is similar to
Tag.decompose()in bs4, and can’t be undone.
When querying by class, consider using CSS instead of XPath.
You can convert CSS to XPath using Parsel’s
Because of lxml, sometimes you may get something rather that what browser show while parsing the HTML using Parsel, here are more details about this issue.
CSS cheatsheet that has good summary of CSS selectors.
XPath Playground to try out your XPath queries.