Scraping Html with Attributes

Question

<tr valign="middle" align="center"> 
<td><b>someNumbers</b></td>
<td width="22" height="22" background="..." class="SomeIntrestingClass">xxxxx</td>
<td width="22" height="22" background="..." class="SomeIntrestingClass">xgdsx</td> 
<td width="22" height="22" background="..." class="SomeIntrestingClass">xyzzx</td>
<td width="22">&nbsp;</td></tr>

Im making an application that needs data from website. I need to extract the values in 'someNumbers' and the values in the td ex:'xyzzx'...
The problem I am having is 'someNumbers doesn't have a class so I tried to use
doc.getElementsByAttributeValue(key, value)
but the attribute there are the same in other parts of the document. How can I extract these values using JSoup or any other bright ideas? Thanks for any advice.

I can just select the td tag. But that will result 1k results and I'm just using 30% of that which 'someNumbers' will be very hard to distinguish. But ill try that. — wtsang02, Dec 22 '12 at 18:18

wtsang02 · Accepted Answer · 2012-12-22T19:26:55.827

0

Document.select(...); What this method does, we are able to use 'css selectors' like td.class or tr td #id and just use them as if they were css selectors in this article in Jsoup.

edited Dec 22 '12 at 19:26

answered Dec 22 '12 at 19:00

wtsang02

18,603
10
49
67

score -1 · Answer 2 · answered Dec 22 '12 at 18:33

-1

<td[^<]+?>*</[^<]+?> use this as the regular expression and store it all in an array

then remove each one by removing <td[^<]+?> and then this </[^<]+?>.

answered Dec 22 '12 at 18:33

Mike Demen

71
7

Please read [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – wtsang02 Dec 22 '12 at 18:35

Scraping Html with Attributes

2 Answers2