10

I'm trying to get this table http://www.datamystic.com/timezone/time_zones.html into array format so I can do whatever I want with it. Preferably in PHP, python or JavaScript.

This is the kind of problem that comes up a lot, so rather than looking for help with this specific problem, I'm looking for ideas on how to solve all similar problems.

BeautifulSoup is the first thing that comes to mind. Another possibility is copying/pasting it in TextMate and then running regular expressions.

What do you suggest?

This is the script that I ended up writing, but as I said, I'm looking for a more general solution.

from BeautifulSoup import BeautifulSoup
import urllib2


url = 'http://www.datamystic.com/timezone/time_zones.html';
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
tables = soup.findAll("table")
table = tables[1]
rows = table.findAll("tr")
for row in rows:
    tds = row.findAll('td')
    if(len(tds)==4):
        countrycode = tds[1].string
        timezone = tds[2].string
        if(type(countrycode) is not type(None) and type(timezone) is not type(None)):
            print "\'%s\' => \'%s\'," % (countrycode.strip(), timezone.strip())

Comments and suggestions for improvement to my python code welcome, too ;)

Zack Burt
  • 8,257
  • 10
  • 53
  • 81
  • 1
    BeautifulSoup (or another parser). It will mostly be trivial, except for those irritating adverts in the middle of the table. – Thomas K Feb 04 '11 at 00:23
  • 2
    Mandatory link due to "html-parsing" and "regex" tags both being present: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Lasse V. Karlsen Feb 04 '11 at 00:36

5 Answers5

6

For your general problem: try lxml.html from the lxml package (think of it as the stdlibs xml.etree on steroids: the same xml api, but with html support, xpath, xslt etc...)

A quick example for your specific case:

from lxml import html

tree = html.parse('http://www.datamystic.com/timezone/time_zones.html')
table = tree.findall('//table')[1]
data = [
           [td.text_content().strip() for td in row.findall('td')] 
           for row in table.findall('tr')
       ]

This will give you a nested list: each sub-list corresponds to a row in the table and contains the data from the cells. The sneakily inserted advertisement rows are not filtered out yet, but it should get you on your way. (and by the way: lxml is fast!)

BUT: More specifically for your particular use case: there are better way to get at timezone database information than scraping that particular webpage (aside: note that the web page actually mentions that you are not allowed to copy its contents). There are even existing libraries that already use this information, see for example python-dateutil.

Steven
  • 28,002
  • 5
  • 61
  • 51
4

Avoid regular expressions for parsing HTML, they're simply not appropriate for it, you want a DOM parser like BeautifulSoup for sure...

A few other alternatives

All of these are reasonably tolerant of poorly formed HTML.

ocodo
  • 29,401
  • 18
  • 105
  • 117
0

I suggest loading the document with an XML parser like DOMDocument::loadHTMLFile that is bundled with PHP and then use XPath to grep the data you need.

This is not the fastest way, but the most readable (in my opinion) in the end. You can use Regex, which will probably be a little faster, but would be bad style (hard to debug, hard to read).

EDIT: Actually this is hard because the page you mentioned is not valid HTML (see validator.w3.org). Especially tags with no opening/closing tag are heavily in the way.

It looks though like xmlstarlet ( http://xmlstar.sourceforge.net/ (great tool)) is able to repair the problem (run xmlstarlet fo -R ). xmlstarlet can also do xpath and xslt script which can help you in extracting your data with a simple shell script.

yankee
  • 38,872
  • 15
  • 103
  • 162
  • The problem with XML parsers are that HTML isn't a subset of XML and unless it is well-formatted per XML rules (or the XML parser is broken) then it will not work correctly. For instance: `

    ` in HTML doesn't even require a closing tag of any sorts. Also, something as simple as ` ` is not valid XML. If the parser (DOMDocument?) is really an HTML parser then it should be called as such and not confused with an XML parser :-)

    –  Feb 04 '11 at 00:32
  • @pst: True, that's why it has two different methods "loadFile()" and "loadHTMLFile()". The php's DOM parser is able to cope with the normal abnormalities. But in this case, as mentioned now it won't do, because the page mentioned is not even valid html) – yankee Feb 04 '11 at 00:36
  • I've not tried it on this specific page, but BeautifulSoup is specifically written with the aim of handling invalid HTML, simply because you inevitably come across it so often. – Thomas K Feb 04 '11 at 00:57
  • Then "I suggest loading the document with an *HTML* parser..." :) –  Feb 04 '11 at 06:50
0

While we were building SerpAPI we tested many platform/parser.

Here is the benchmark result for Python.

python parser benchmark

For more, here is a full article on Medium: https://medium.com/@vikoky/fastest-html-parser-available-now-f677a68b81dd

jvmvik
  • 319
  • 1
  • 3
  • 9
-2

The efficiency of a regex is superior to a DOM parser.

Look at this comparison:

http://www.rockto.com/launcher/28852/mochien.com/Blog/Read/A300111001736/Regex-VS-DOM-untuk-Rockto-Team

You can find many more searching the web.

ocodo
  • 29,401
  • 18
  • 105
  • 117