How can I scrape/parse this data using regular expressions?

Question

I'm really a beginner when it comes to regular expressions, and I'm not really sure where to start. I have some html code scraped from a web page and stored in a variable, and it looks something like this:

<thead><tr>
<th></th>
<th>GENERAL INFORMATION</th>
<th></th>
<th>DETAILED DATA</th>
</tr></thead>
<tbody><tr>
<th>ID</th>
<td>123456789ABCD</td>
<th>Field1</th>
<td>6 = (Some-Specification (3 or more details))</td>

</tr></tbody>
<tbody><tr>
<th>AGL</th>
<td>1 - United States ; TH - Some Data</td>
<th>Field2</th>
<td>7 = (Option/Other Option)</td>
</tr></tbody>
<tbody><tr>
<th>MANUFACTURER</th>
<td>2010 SPECIFICATION  (ADSD: HMKC)</td>
<th>Field3</th>

<td>8 = (My Type)</td>
</tr></tbody>
<tbody><tr>
<th>MODEL</th>
<td>6X4 MY-MODEL/SOME_SPECS LONG SPECIFICATION, BLAH</td>
<th>Field4</th>
<td>9 = (STUFF/OTHER STUFF)</td>
</tr></tbody>
<tbody>

And then there is more of the same... I would like to parse the data from these cells into variables. (e.g. parse "123456789ABCD" into an ID variable) I'm working in ColdFusion and was thinking of using methods like REFindNoCase, REReplaceNoCase, SpanExcluding... Any idea how I can accomplish this? Or if you're not familiar with ColdFusion, even just the regular expressions necessary to parse this data would be very useful.

don't use a regex for parsing html/xml content. use a dom/xml parser like xerces — bcosca, Nov 30 '10 at 08:09
Please see this rather popular answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Graham Clark, Nov 30 '10 at 08:10

peter.murray.rust · Answer 1 · 2010-11-30T11:17:07.113

7

Don't use Regex for HTML. It will destroy you.

If you are doing a lot of this you should get an HTML tool such as TagSoup which normalizes the HTML. If you are working with web pages from one site, then you can create an XSLT stylesheet (or a DOM tool using XPath) which extracts the cells you want.

An Xpath for your cell (I have omitted the HTML namespace) could be

//tbody/tr[1]/td[1]

or you may wish to find rows by ID

//tbody/tr[th='ID']]/td

[The HTML looks rather messy - it uses th and td in the same tr which is not idiomatic.]

edited Nov 30 '10 at 11:17

answered Nov 30 '10 at 08:07

peter.murray.rust

37,407
44
153
217

We're not really doing a lot of this. This is a sort of temporary measure that needs to get in quickly. – froadie Nov 30 '10 at 08:08
@froadie - It will be a world of pain now and a bigger world of pain later to try to misuse regular expressions. – Chris Lutz Nov 30 '10 at 08:10
@froadie: Still, use a DOM library. Youll spend more time crafting regexes than you will using the DOM. – prodigitalson Nov 30 '10 at 08:11
@froadie using `re` for HTML/XML parsing may bring many bugs that are hard to detect, especially when you have lots of data. – khachik Nov 30 '10 at 08:11
Hmmm. I see what you're saying. But keep in mind that firstly, the HTML is a mess and I can't control it or change it, and secondly, this is basically the only block of code that I have to parse (it's part of an entire page but it's not much longer than this). Is there any way you can recommend to accomplish this without downloading any external libraries, and in the simplest/easiest way possible? – froadie Nov 30 '10 at 08:28
@froadie You absolutely MUST use external libraries. Others have done the hard and messy work. It's a minute or two to get TagSoup working and it will run anywhere. John Cowan knows more about this tha me and you and Godzilla put togther. Use the force. – peter.murray.rust Nov 30 '10 at 08:52
2

You are all right about not using regexes for HTML in general, but in certain cases its just the quickest solution. Whats the problem about this? I used regexes hundreds of times to parse HTML in special cases and I almost always achieved what I needed! You can easily parse this HTML with regexes, although you can not be sure that it will work for any HTML of course. – morja Nov 30 '10 at 10:20

score 1 · Answer 2 · answered Nov 30 '10 at 08:20

1

Use the CF xml parser, XmlParse. Looks like its based on strict XML though so make sure you run the input through something like htmltidy.

answered Nov 30 '10 at 08:20

prodigitalson

60,050
10
100
114

score 1 · Accepted Answer · answered Nov 30 '10 at 10:35

I agree with the main opinion on this platform that parsing HTML with regexes is not the "golden path". But in some cases it is just the easiest way to go and it just does what it needs to do.

This regex should do what you need:

<th>((?!</th>).)*</th>\s*<td>((?!</td>).)*</td>

Use the capturing group 1 for the key and group 2 for the value.

I dont know ColdFusion so I can not tell you how to apply it.

How can I scrape/parse this data using regular expressions?

3 Answers3