0

I am interested in parsing the following table and others like it: http://www.cityofames.org/ftp/routes/Fall/wdreds&w.html

Any suggestions on the best tool for the job? After searching around I can't decide what I should use and would like to get some feedback before committing to something.

I am open to any languages/tools.

tgai
  • 1,117
  • 2
  • 14
  • 29
  • What format do you want to parse it into? – Petah Dec 14 '11 at 04:12
  • @Petah: I would want the columns separated into arrays of times, or something along those lines. – tgai Dec 14 '11 at 04:15
  • What kind of arrays, JSON, PHP, etc – Petah Dec 14 '11 at 04:17
  • @Petah: Well I was thinking about creating a new file locally possibly in a format such as CSV to be used elsewhere. So something that would facilitate that. Sorry to be so vague. – tgai Dec 14 '11 at 04:28

3 Answers3

1

If you are looking for an HTML parser, there are number of options in Java:

You might also want to go through a very comprehensive discussion on pros and cons of using each of these here.

Community
  • 1
  • 1
Umer Hayat
  • 1,993
  • 5
  • 31
  • 58
1

With lynx I can do:

$ lynx -dump http://www.cityofames.org/ftp/routes/Fall/wdreds\&w.html
    6:25  6:31  6:36  6:41 -----  6:46  6:50      6:56
    7:02  7:08  7:14  7:20 -----  7:26  7:30      7:36
   ----- ----- ----- -----  7:38  7:43  7:47      7:53 1A
    7:28  7:35  7:42  7:48 -----  7:56  8:00      8:06
   ----- ----- ----- -----  7:58  8:03  8:07      8:13 1A
...

becomes very easy to parse with scripting language of choice, html2text may also work(never used it).

You could also play around with grep/sed to format it.

fifo
  • 11
  • 1
1

HTML is too difficult to be understood by any parser. You need to first convert this to a reasonably close XML format(for wellformedness- means tags that are matched) like XHTML using a program like tidy(http://tidy.sourceforge.net/). You can then use a XML/XHTML parser to parse the wellformed XML. Note that you will have to process your data based on font styles and convert the tags based on font styles to an array of times.

Here is what you can do when parsing

start TR element
  --Create Array
 start b element
  -- Add One time
 end b element
 start b element
  -- Add second time
 end b element
end TR element        
randominstanceOfLivingThing
  • 16,873
  • 13
  • 49
  • 72