Monday, May 04, 2009

Web scraping tutorial

Web scraping (or Web harvesting, Web data extraction) is a computer software technique of extracting information from websites.

I got a freelance work to extract all the hotel information in UK of some city from yellow pages, I wrote a simple php script which uses curl to get the data and parse it using regular expression and extract the require data and populate db, sorry was not aware probably ignored policies.

Now there is a PHP library that facilitates the process of creating web scrapers called Simplehtmldom. More information can be found here.

2 comments:

Fuller said...

MetaSeeker is a free Web scraper factory. A new scraper for a target site is created in minutes without coding a single line.

XPath, XSLT and XML are made use of to express Web data extraction rules and to store extraction results.


It can be downloaded for free from http://www.gooseeker.com/en/node/download/front

Anonymous said...

Interesting points on web scrapers, For simple stuff i use python to get or simplify data, data extraction can be a time consuming process but for other projects that include documents, files, or the web i tried "website scraper" which worked great, they build quick custom screen scrapers, web scrapers, and data parsing programs