After work today I started playing around with web scraping in Python using requests and BeautifulSoup, following along with the tutorials in the book Web Scraping with Python by Ryan Mitchell. Specifically, I want to be able to scrape the AngelList website to create my own angel investor database for project I’m calling AngelDB.xyz. However, I quickly ran into problems related to Cloudflare and their anti-bot protections.
This isn’t the first time Cloudflare has foiled one of my legitmate projects (they pretty much rendered my Chrome Extension the Internet Archivist’s Intrepid Extension useless after a few months). Luckily this time around, I found a pretty sweet library to help me bypass Cloudflare and scrape on:
cloudflare-scrape by Anorov
I haven’t gotten an opportunity to play with the library just yet as I just discovered it a few minutes ago and wanted to bookmark it here. However, after I get node.js installed here on my Windows machine I play on taking it for a spin. Wish me luck!
UPDATE (5/4/2019 10:43PM): After playing around with cloudflare-scrape for a little bit I could not get it to bypass cloudflare’s bot-security measures, and ended up receiving the same cloudflare html instead of the page that I actually wanted just as before. So if anyone happens to stumble upon this blog post, I’m skeptical that cloudflare-scrape will actually work for you.