Software Guides, Tutorials and News
How to Add your Proxies to CBT Web Scraper and Data Extractor
Why do you need proxies for web scraping?
If you are planning to scrape websites using multiple threads, you will need to use proxies. Otherwise, if you are scraping with one IP address too fast, the search engines and websites will figure out that this is non-human behaviour and at first, you could be presented with captchas and later, your IP address could get banned. Proxies will connect from different IP addresses which means that the websites that you are scraping will not figure out that the requests are coming from the same computer/IP address.
What are the different types of Proxies you could use for website scraping and email extraction?
There are many types of proxies on the market but we would recommend private, shared or backconnect rotating proxies. Private proxies are the best because they would only be used by you. However, private proxies are the most expensive. Next in line are shared proxies which are the same as private proxies in nature but they are shared amongst several users. Lastly, we have the backconnect rotating proxies. Such proxies change the IP address with every connection or at set intervals of time. Backconnect rotating proxies usually have a fairly large pool of IP addresses. For example, Storm Proxies has a pool of 70,000 proxies which are changed every week. Backconnect proxies are good in the sense that with every request, you are getting a unique IP address. They are a cheaper option to shared and private proxies. However, backconnect rotating proxies tend to be used by a lot of people which means that they are likely to encounter captchas and even bans. We would recommend private and shared proxies. Whilst we have an option to use public proxies for the sake of completeness, we do not recommend them. Public proxies tend to be very unreliable and are spammed to death across the world.
Can I use a VPN for Web Scraping?
Yes, but we do not recommend it. Technically, you can use a VPN with timed out IP change. However, the issue with VPNs is that you will still need to run your software on a lower thread number. Likewise, VPNs are used by many people and you are likely to get problematic IP addresses. If you are planning to use a VPN with a timed out IP change, do not forget to check the "Use an Integrated Web Browser instead of an HTTP request" on the main GUI.
How to add your proxies to CBT Web Scraper and Email Extractor
Run the software. Go to settings, proxy settings tab. Here, you will be able to add your proxies. You can either upload your proxies from a notepad text file or paste them from clipboard. When adding the proxies, make sure that they are in the following format: IP:PORT OR IP:PORT:USERNAME:PASSWORD . If you are using the former format, make sure that your proxies are authenticated. This usually involves authenticated the IP address on which the CBT email grabber is running. You can set the time interval for rotating the proxies.
Then, you can test the proxies and remove non-working ones using our in-built proxy testing tool. You will need to wait until the tool finishes running.
Then, you will need to enable proxies on the main GUI by checking the "Use Proxies" option.
Tips for Scraping Google Search Engine, Google Maps and UK Yellow Pages (Yell.com)
If you are planning to scrape and extract data from Google Search Engine, Google Maps or UK Yellow Pages (Yell.com), we recommend that you use a decent amount of quality shared or private proxies. UK Yellow Pages has a very strict security and bans proxies very quickly. For this reason, backconnect proxies are ideal for UK Yellow Pages because they will give you the benefit of thousands of IPs at the lowest price. Google and Google Maps are also very sensitive to web scraping and data extraction and also have a tendency to either ban or put proxies through the Google image captcha checkpoints. When scraping these sites, you can also increase the delay between requests on the main GUI. This is particularly useful if you have fewer proxies.
Is it possible to Scrape Websites without Proxies
Yes, you can still scrape websites and search engines without proxies. However, you will need to increase the delays between each request and run each scraper on a single thread. For example, if you are scraping Bing and Yahoo, you will need to run each Bing scraper and each Yahoo scraper using a single thread. If you are planning to scrape Yellow Pages and Yelp business directories, you will need proxies. Likewise, Google and Google Maps will also need proxies.