Scraping with Custom Headers
Custom headers are a great tool to get access to more detailed data that is available to unverified users.
It could be used for development testing or to overcome the protection barriers on different websites. Websites with huge traffic and valuable data like prices or live statistics usually has it’s ow protection systems that help them to detect bots and parsers.
Lets’ take a look at the most common protections and how to overcome them:
1) IP address
The first thing you should take a look when you’re creating a scraper is these 2 things.
To parse huge number of pages you’ll need a rotating IP addresses that are not blacklisted on public databases.
2) UserAgent
Also, you would need to specify a UserAgent in all these requests. But it couldn’t be something randomly generated. As it can be easily checked if this user agent of a real device or faked. The best solution for this is to have your own database with various UserAgents, containing records of real mobile/desktop/tablet devices. And to select the UserAgent for every request from recent records from this database. It helps you to avoid getting errors like “Your browser is outdated, please download a newer version.”.
3) Google Analytics or cookies from other services
Sometimes bots can be detected by absence of cookies from the services that are used on the targeted website. To pass this step of verification it’s a good idea to make up a list with th services that are used by the website and include needed cookies to every request that you’re trying to get information from.
4) JS support check
Some websites especially those, who rely on heavy JS frameworks and AJAX requests won’t show you any valuable data without JS rendering. There are two options you can do with this situation: either you should use a browser emulation for this type of websites or dive in to it’s requests from front-end to the back-end and figure it out what headers and parameters are needed to extract the data. Though it might be a complex task usually it more reliable way of data as front-end might be changed relative often compared to the change in server-side data.