Bypassing Cloudflare When Web Scraping with Python, requests, & BeautifulSoup

After work today I started playing around with web scraping in Python using requests and BeautifulSoup, following along with the tutorials in the book Web Scraping with Python by Ryan Mitchell. Specifically, I want to be able to scrape the AngelList website to create my own angel investor database for project I’m calling AngelDB.xyz. However, I quickly ran into problems related to Cloudflare and their anti-bot protections.

This isn’t the first time Cloudflare has foiled one of my legitmate projects (they pretty much rendered my Chrome Extension the Internet Archivist’s Intrepid Extension useless after a few months). Luckily this time around, I found a pretty sweet library to help me bypass Cloudflare and scrape on:

cloudflare-scrape by Anorov

I haven’t gotten an opportunity to play with the library just yet as I just discovered it a few minutes ago and wanted to bookmark it here. However, after I get node.js installed here on my Windows machine I play on taking it for a spin. Wish me luck!

UPDATE (5/4/2019 10:43PM): After playing around with cloudflare-scrape for a little bit I could not get it to bypass cloudflare’s bot-security measures, and ended up receiving the same cloudflare html instead of the page that I actually wanted just as before. So if anyone happens to stumble upon this blog post, I’m skeptical that cloudflare-scrape will actually work for you.

UPDATE (1/13/2021 92:24AM): I noticed this blog post has been getting some traffic so I wanted to post this update. A year after originally writing this I’ve discovered that the real answer to getting past Cloudflare is to use a proper web scraping service. This is a very common problem in web scraping, so common that there are many services available to help get past common road blocks like Cloudflare. I personally suggest Scraping Bee (https://www.scrapingbee.com). Scraping Bee is an excellent web scraping service which I have been using for my latest web scraping project. It’s very easy to get started and is available at very reasonable cost. The best part about it is that you will only need to tweak your Python/BeautifulSoup code just a little bit to get it working. Essentially, instead of scraping pages directly, you’ll ping Scraping Bee’s servers and they’ll pass the HTML/XML etc back to you.

 

Fat Camp Revisited

Sometimes I like to use this blog to post notes to myself, sort of like a journal for the 21st century. And for this post, that’s what I want to do– bookmark a few interesting links that I think might help me build this idea I that I’ve had floating around for awhile now: Fat Camp.

What is Fat Camp? Essentially, I want to build a fitness competition app similar to the annual October fitness challenge as seen on the Joe Rogan Podcast, but instead of famous comedians from the show competing against each other, anyone can join the competition and lose weight.

I think this would be fairly easy to do with an Apple Watch, but I really want to be able to add weight lifting in to the app (and possibly a scale as well). While I suppose users could enter their reps into the app manually, I suspect that too many people would cheat for this to really work. However, I think I may have found a semi-suitable work-around in Android Things.

Prior to tonight I had never heard of Android Things, the IoT platform from Google. But apparently, Google has gone out and created an entire SDK for building IoT apps powered by a headless version of Android, and they even have a RaspberryPi powered prototyping kit to go with it! Pretty awesome huh?

Right now I have a lot on my plate, so I don’t think I will realistically have time to work on this at the moment. But at some point in the near future (next 5 years?) I would like to play around with this and see how far I can get:

  1. Android Things
  2. Get Started with Kits (Android Things)
  3. Android Things Starter Kit ($199)
 

When Does the Pioneer Tournament Start? (pioneer.app)

On a road-trip recently from Dallas to Austin I happened catch an episode of the IndieHackers podcast featuring Daniel Gross of Pioneer. Since first hearing the episode I’ve become sort of obsessed with Pioneer and their month long Pioneer Tournament. If you haven’t already heard of Pioneer it’s essentially an online startup accelerator where founders compete in a month long tournament to receive $7,000 in early-stage / super-pre-seed funding. Maybe not quite as nice as YC’s $150,000 standard deal, but pretty good for an online accelerator!

Now that I think about it, a better comparison would probably be YC’s Startup School— very similar concept with a similar amount of funding awarded to the winners. But I don’t know, I think Pioneer seems like it’s going to be more fun than startup school. Startup School is more of a MOOC, whereas Gross has really gone out of his way to gamify Pioneer to make it fun.

Okay, so when does it start?

Since the tournament is a month long I assumed that registration would begin the first week of every month. But after doing a little research, it appears that the tournament doesn’t run every single month. There tends to be a little bit of a gap between tournaments where the Pioneer team must be making tweaks and whatnot to their software.

As of May 2019 there have been 4 month long Pioneer Tournaments. According to my research on the Wayback Machine, these tournaments have begun:

  1. August 19th, 2018
  2. October 28th, 2018
  3. February 10th, 2019
  4. April 7th, 2019

So there you have it. If anyone else out there on the internet can’t wait for the next tournament to start, now you have some sort of idea when the next one might begin.

 

Tony Romo’s Two Minute Drill

Silly python game for my code class students:

 

Product Hunt Doesn’t Work

A little bit of a salty post here today. I launched MoneyPhone on Product Hunt this morning and to my surprise I didn’t get a single signup. But apparently LinkedIn on the Blockchain got 295 upvotes? This is the third or fourth time I’ve launched something on Product Hunt and I don’t think I’ve ever actually gotten a signup or download. So, despite all of the hype and buzz around Product Hunt, I don’t think the site really works. Well, maybe it sort of works for small minority of people, but I’m going to assume that the majority of startups that launch of Product Hunt probably don’t end up getting much value out of it.

And this goes for more than just Product Hunt. I’ve noticed that link sharing websites in general that use an upvote/downvote system (such as Reddit and Hacker News) only really work if you pay for traffic. If you don’t pay to promote your link the chances of it climbing the rankings and getting any traffic coming your way is pretty much zero.

So in conclusion, I don’t think Product Hunt really works and you’re better off just paying $100 or whatnot to promote your new app on Reddit or Twitter, etc.

 

Learn to Code in 30 Seconds

Computer programming is the act of writing (typing) instructions for a computer to follow, as opposed to controlling a computer using a mouse or touch screen. The basic commands we give the computer are called functions. More advanced commands/instructions include loops which instruct the computer to repeat things, and if-statements which allow us to instruct the computer on how to handle decision making. Along with loops and functions we can also instruct the computer to store data with variables and arrays.

 

You need supervisord

This blog post would have been more appropriately titled, “I need supervisord,” but I decided instead to go with a more click baity title instead. Hope you forgive me. Okay… so what’s all this supervisord business about, and why do you and I need it?

I just launched my latest side project, MoneyPhone, a personal finance web-app powered by Flask into production on a live virtual server the other day and have been struggling with the fact that every morning when I wake up the app has quit working! The fix is pretty simple however, I just restart my Ubuntu virtual server and everything starts working again. However, I don’t want to have to keep doing this every morning, and neither do you! So it looks like you and I need supervisord in our lives.

For the past two days I’ve been searching for answers regarding this bug in my free time, and the name supervisord seems to keep popping up so I’m going to assume that this is probably what I’m looking for. And this morning I happened to stumble upon a pretty interesting article regarding Steve Huffman, co-founder and developer of Reddit, once had the exact same problem after launching Reddit. The author of the aforementioned post writes:

Waiting around to restart your web server is painful – Steve Huffman, one of the founders of Reddit talks about literally sleeping with his laptop next to his bed and constantly waking to restart a process at ungodly hours of the morning… Eventually he discovered supervisord and got some sleep.

UPDATE (5/1/2019): I actually did not end up needing supervisord or any related software to fix my woes. In the end, I just needed to beef up my $5 per month virtual server from DigitalOcean to a $40 per month vps… whoops! So I suppose a more appropriate title for this blog post would have been: You don’t need supervisord, but you probably need more computing power. Since beefing up the server, the app hasn’t crashed once.

 

Requesting Access to the Plaid API Production Environment

This blog post is brought to you by the developer of BitBudget. BitBudget is an automated budgeting app for Android and iOS which syncs with your bank account and helps you avoid overspending. If you’d like to quit living paycheck-to-paycheck and get a better handle on your finances, download it today! https://bitbudget.io

Just submitted my request to Plaid to push MoneyPhone from the Plaid Development Environment to the Live Production Environment with Unlimited Users and No Rate Limiting. Sort of nervous and hoping everything goes well with my request. Wish me luck!

 

moneyphone.app is live

This blog post is brought to you by the developer of BitBudget. BitBudget is an automated budgeting app for Android and iOS which syncs with your bank account and helps you avoid overspending. If you’d like to quit living paycheck-to-paycheck and get a better handle on your finances, download it today! https://bitbudget.io

moneyphone.app is live! (and one month ahead of schedule).

 

Why Your Startup Shouldn’t Pay the State of Texas $750

This blog post is brought to you by the developer of BitBudget. BitBudget is an automated budgeting app for Android and iOS which syncs with your bank account and helps you avoid overspending. If you’d like to quit living paycheck-to-paycheck and get a better handle on your finances, download it today! https://bitbudget.io

This isn’t really an original post, but I sort of just wanted to save this information here on the blog because it could mean the difference between success and failure for anyone starting a startup in the State of Texas. If you’ve ever started a startup before, one thing you may find out early on is that the administrative costs of operating a Delaware C Corporation can quickly exceed all of your other operating costs combined! $2,000 to incorporate through a service like Clerky, and another $800 per year miscellaneous state filing fees, this is likely to add up to more than you would spend on your DigitalOcean server bill in a year.

At the moment I’m consider forming another C Corporation for one of my latest software ventures, MoneyPhone, but I’ve sort of been balking at the idea of spending all that money once again because I know if I don’t end up receiving funding within 12 months or so I’ll ultimately end up needing to shut the C Corporation down even if the app is doing okay just to keep expenses under control.

Thankfully I stumbled upon this excellent blog post from a startup lawyer here in Texas who claims you just shouldn’t bother paying the State of Texas their crazy $750 foreign entity registration fee, and save yourself another few hundred bucks a year skipping the yearly filing fees (until your startup is actually generating revenue). This doesn’t eliminate all of the costs that I mentioned with starting a Delaware C Corporation in the State of Texas, but I do estimate that it could cut the fees in half. Also, by switching from Clerky to someone a little cheaper like Stripe Atlas or maybe someone else you can likely cut the cost of incorporation from $2,000 to more like $500 or less.

And last, here is a link to the original blog post if anyone is interested: When do I “really” need to qualify my Delaware-formed startup in Texas?

Anyway, I’ll probably make another post here on the blog sometime soon should I end up incorporating my latest venture. One thing I’m a little worried about is that the startup incorporation services like Stripe Atlas may force you register with the state. If that’s the case, I’m not sure exactly what I’m going to do. I know conventional wisdom says you should always incorporate as a Delaware C Corporation, but from my experience it’s often just a money sink. Spend $3,000 to please potential investors, but then never actually raise a dime. Hmm…

UPDATE: According to the “Startup Cheat-Sheet” you can actually find examples/templates of all the needed paperwork online for free and register as a Delaware C Corporation for $139 going the DIY route– Startup Cheat-Sheet: How to Incorporate Your Company