After work today I started playing around with web scraping in Python using requests and BeautifulSoup, following along with the tutorials in the book Web Scraping with Python by Ryan Mitchell. Specifically, I want to be able to scrape the AngelList website to create my own angel investor database for project I’m calling AngelDB.xyz. However, I quickly ran into problems related to Cloudflare and their anti-bot protections.
This isn’t the first time Cloudflare has foiled one of my legitmate projects (they pretty much rendered my Chrome Extension the Internet Archivist’s Intrepid Extension useless after a few months). Luckily this time around, I found a pretty sweet library to help me bypass Cloudflare and scrape on:
I haven’t gotten an opportunity to play with the library just yet as I just discovered it a few minutes ago and wanted to bookmark it here. However, after I get node.js installed here on my Windows machine I play on taking it for a spin. Wish me luck!
UPDATE (5/4/2019 10:43PM): After playing around with cloudflare-scrape for a little bit I could not get it to bypass cloudflare’s bot-security measures, and ended up receiving the same cloudflare html instead of the page that I actually wanted just as before. So if anyone happens to stumble upon this blog post, I’m skeptical that cloudflare-scrape will actually work for you.
UPDATE (1/13/2021 92:24AM): I noticed this blog post has been getting some traffic so I wanted to post this update. A year after originally writing this I’ve discovered that the real answer to getting past Cloudflare is to use a proper web scraping service. This is a very common problem in web scraping, so common that there are many services available to help get past common road blocks like Cloudflare. I personally suggest Scraping Bee (https://www.scrapingbee.com). Scraping Bee is an excellent web scraping service which I have been using for my latest web scraping project. It’s very easy to get started and is available at very reasonable cost. The best part about it is that you will only need to tweak your Python/BeautifulSoup code just a little bit to get it working. Essentially, instead of scraping pages directly, you’ll ping Scraping Bee’s servers and they’ll pass the HTML/XML etc back to you.