Scraping on a Budget

You need a scraper!

The one thing you are going to need is a scraper to help you speed things up. I highly recommend scrapebox if you currently do not own it. Since my blog is mainly about Internet marketing on a budget, you can also use the free version of GScraper. The free version of GScraper is pretty awesome [at least for the price ;)]. It will also remove duplicates as you go, while scrapebox version 1.x cannot (version 2 can). The good thing about scrapebox is that it has a built in proxy scraper. Trying to scrape google without proxies is just going to land you in ban land.

The easy way to get setup for scraping:

Open up scrapebox and scrape some public proxies. Once you have some google passed proxies take 5 or 10 of them and throw them into keyword section of the harvester. Set the time to the past 24 hours and scrape. This should give you a small list of urls. Remove duplicate urls (not domains) and then export the list as a normal text file. Now open up the proxy harvester and click on Harvest Proxies. Click on add source and import a list of urls. Obviously, you are going to import the list of urls that you just harvested. Now hit start and wait until it is finished. Once it’s finished, remove any sources that have 0 found results. I usually repeat this step every couples of days when scraping with public proxies. When I check these proxies I will put the connections to max, and I will only check for anonymous to start with. Chances are you will have a huge list of proxies to check and only checking for anonymous will save a great deal of time. Once you have a list of verified anonymous proxies save them to a file. Keeping a master list of anonymous proxies will pay off in the future. Most people churn and burn proxies and they end up being good days or weeks down the road. Check the proxies against google now. I actually retest the failed proxies a second against google after the first run. This almost always gives me more G passed proxies. Now you are setup and ready to scrape. Depending on how many proxies you have, you should be able to set the maximum connections for scraping to around 250.

The hard but free way to scraping:

You are going to need the free version of GScraper. You can download it here: http://blog.gscraper.com/index.php/2013/01/29/gscraper-basicfree-version-updated-now-support-windows-8/

You are also going to need some free proxies.

Gather proxy is very easy to use but I think it is limited to 2000 proxies. ProxyFire is free and works better than any other proxy scraper. However, proxyfire is fairly complicated to get setup and running. Unfortunately, setting up proxyfire would require a huge tutorial and I do not have the time right now. You can find tutorials for proxyfire here: http://www.proxyfire.net/forum/showthread.php?t=3374
Once you get your proxies your pretty much ready to go. You will need to fire up GScraper and put your proxies in the proxy section. Don’t forget to set any options you may need. I always set it to remove duplicate urls while scraping. This is the one feature I really like (although it appears to be in the scrapebox version 2.x). You will need to figure out how many connections to use based on your proxies, PC, and/or internet speeds.

The key to scraping is proxies:

I personally use semi dedicated private proxies from BuyProxies  for scraping. I run very few threads when scraping with these proxies but get much better results than public proxies. Since we are focusing on having a small budget or no budget here are some programs that you can scrape public proxies with.

http://www.gatherproxy.com/gptool
The free version is limited but you can get around 2000 proxies with it.
http://www.proxyfire.net/
This is complicated to use but it beats every proxy scraper out there.
http://www.project2025.com/charon.php
This is more of a proxy checker, but works better than any other proxy checker.
http://www40.zippyshare.com/v/69499165/file.html
This will let you import a list of urls and scrape the sites for proxies.

Add a Comment

Your email address will not be published. Required fields are marked *