Posts Tagged ‘web scraping’

Data Scraping, A Fine Line…

January 3, 2011 Leave a comment

If anyone has ever browsed the programming section of any freelance site, chances are you’ve seen people/companies looking for someone to grab data from another website.  A quick note here, I’m calling it data scraping, it has a few names (web scraping, screen scraping, etc).  In the end it’s usually the same thing.  Data scraping usually involves writing some code that will go out to a specified website/websites and pull out specific data and write it somewhere else.  It’s usually a database or possibly a text file or csv file or really anything.  The whole process revolves around being able to find patterns in the code of the website.  If you’re bored, go to a website, view source, and look for a pattern in the code.  Tables, specifically, are a scrapers best friend.  Generally you’d want to grab the info inside the table which could look like: <td>info you want</td><td>more info you want</td>.  There’s your pattern (the <td></td>’s) now you just tell your scraper to grab what you want from between the tags.

I first heard about it around 7 years ago at my old job when I was asked to do it.  The very first question I asked was if it was legal, I never got a definitive answer but I did it because I had just started at the company and didn’t want to make a big deal about it.  I did think it seemed shady, but I eventually found out through my own research that what I was specifically doing was not technically illegal. 

The information I was scraping was considered historical fact, therefore nobody owns information itself, it is what it is.  An example of something that may fall into the category of “legally scrapable” would be scraping zip codes.  Although there are a few other things you may need to consider such as the websites’  T.O.S.  which forbid it (the data itself would be fine, but you could be considered as trespassing on the websites server). 

Websites do have ways to combat this,  but as anyone who owns a computer knows, as soon as someone has figured out a way to fix a problem, someone else has figured out a way around that fix.  An example could be a website blocking an IP address temporarily or permanently that they notice is reading way too many websites in too short a period of time.  Basically it’s not acting as a human would.  To get around that, a person may have their scraper run through a proxy server or multiple proxy servers to constantly change up the ip address that the website is seeing.  That is a broad example as there are more sophisticated ways to protect against it, and to get around those sophisticated ways.

Why do people/companies even bother doing it if there’s a chance it could be illegal?  Saves money and time.  Instead of having someone or a group of people go out and manually enter this stuff into a database or wherever you want it, why not have a computer do it for you in a fraction of the time and with guaranteed accuracy as long as the code is good.  In that case, the company didn’t have to pay the salary for a data entry person or didn’t have to have an existing employee take time out to do it instead of working on something else. 

In the end (thankfully!), if you’re considering creating a scraper for a customer or company, I’d recommend doing your research to make sure you’re not violating the websites T.O.S or any laws.  If you’re being asked to scrape original content, meaning content that is not fact but someones opinion or ideas, there’s a good chance you’re violating something and should probably stay clear of that project.  Also, if you realize that you’re constantly having to find ways around a websites security, like using a bunch of different proxies, chances are they don’t want you scraping their site or your scraper is looking like a denial of service attack.  In the end, just be smart and do your research so you don’t get yourself or your employer into any type of trouble.

If you’re new to programming or data scraping a good site to check out would be:  If you’re going to scrape you will definitely want familiarize yourself with regular expressions!