does this look like a good design ?
No.
what tools would you choose to do it in if you program in python ?
Beautiful Soup
find automatically the right table in the HTML page, -- probably by searching the text for some sample data and trying to find the common HTML element which contain both
Bad idea. A better idea is to write a short script to find all tables, dump the table and the XPath to the table. A person looks at the table and copies the XPath into a script.
extract the rows -- by looking at above two elements and selecting the same patten
Bad idea. A better idea is to write a short script to find all tables, dump the table with the headings. A person looks at the table and configures a short block of Python code to map the table columns to data elements in a namedtuple.
identify which column contains what -- by using some fuzzy algorithm to best guess which column is what.
A person can do this trivially.
export it to some python / other list -- cleaning everytihng.
Almost a good idea.
A person picks the right XPath to the table. A person writes a short snippet of code to map column names to a namedtuple. Given these parameters, then a Python script can get the table, map the data and produce some useful output.
Why include a person?
Because web pages are filled with notoriously bad errors.
After having spent the last three years doing this, I'm pretty sure that fuzzy logic and magical "trying to find" and "selecting the same patten" isn't a good idea and doesn't work.
It's easier to write a simple script to create a "data profile" of the page.
It's easier to write a simple script reads a configuration file and does the processing.