1

I have this problem, I need to scrape lots of different HTML data sources, each data source contains a table with lots of rows, for example country name, phone number, price per minute.

I would like to build some semi automatic scraper which will try to ..

  1. find automatically the right table in the HTML page, -- probably by searching the text for some sample data and trying to find the common HTML element which contain both

  2. extract the rows -- by looking at above two elements and selecting the same patten

  3. identify which column contains what -- by using some fuzzy algorithm to best guess which column is what.

  4. export it to some python / other list -- cleaning everytihng.

does this look like a good design ? what tools would you choose to do it in if you program in python ?

kokoko
  • 141
  • 10

2 Answers2

4

does this look like a good design ?

No.

what tools would you choose to do it in if you program in python ?

Beautiful Soup

find automatically the right table in the HTML page, -- probably by searching the text for some sample data and trying to find the common HTML element which contain both

Bad idea. A better idea is to write a short script to find all tables, dump the table and the XPath to the table. A person looks at the table and copies the XPath into a script.

extract the rows -- by looking at above two elements and selecting the same patten

Bad idea. A better idea is to write a short script to find all tables, dump the table with the headings. A person looks at the table and configures a short block of Python code to map the table columns to data elements in a namedtuple.

identify which column contains what -- by using some fuzzy algorithm to best guess which column is what.

A person can do this trivially.

export it to some python / other list -- cleaning everytihng.

Almost a good idea.

A person picks the right XPath to the table. A person writes a short snippet of code to map column names to a namedtuple. Given these parameters, then a Python script can get the table, map the data and produce some useful output.

Why include a person?

Because web pages are filled with notoriously bad errors.

After having spent the last three years doing this, I'm pretty sure that fuzzy logic and magical "trying to find" and "selecting the same patten" isn't a good idea and doesn't work.

It's easier to write a simple script to create a "data profile" of the page.

It's easier to write a simple script reads a configuration file and does the processing.

S.Lott
  • 384,516
  • 81
  • 508
  • 779
  • 3
    +1 for "notoriously bad errors." The only thing worse than scraping html is scraping syntactically incorrect and arbitrarily written html. – waffle paradox Jul 28 '11 at 02:03
0

I cannot see better solution.

It is convenient to use XPath to find the right table.

Community
  • 1
  • 1
eugene_che
  • 1,997
  • 12
  • 12