-1

i'm a beginner to regex expression so I'm having trouble with this.

Given the string below, how can I write a regular expression that just matches "69144"? Some surrounding text would also be fine, so long as I can narrow this down.

Citations</a></td><td class="cit-borderleft cit-data">69144</td><td class="cit-borderleft
cit data">22047</td></tr><tr class="cit-borderbottom"><td class="cit-caption"><a href="#"
class="cit-dark-link" onclick="return citToggleIndexDef('h_index_definition')" title='
h-index is the largest number h such that h publications have at least h citations. 
The second column has the &quot;recent&quot; version of this metric which is the largest 
number h such that h publications have at least h new citations in the last 5 years.
 '>h-index</a></td><td class="cit-borderleft cit-data">88</td>

I apologize for the string being extremely hard to read.

  • 3
    Use a HTML parser such as JSoup – Reimeus Jul 28 '13 at 23:28
  • You may be able to get away with just matching ``. Have you tried anything at all yet? – paddy Jul 28 '13 at 23:29
  • 2
    Related: http://stackoverflow.com/q/1732348/1065197 – Luiggi Mendoza Jul 28 '13 at 23:33
  • So, I came up with: ([0-9]+) And I'm trying to extract what you suggested @paddy along with the numbers that precede it. However, I tryed the expression on regexpal.com with this: view-source:http://scholar.google.ca/citations?user=JicYPdAAAAAJ&hl=en&oi=ao and its not really working properly? Is there something wrong with my expression ... ? – user2608895 Jul 28 '13 at 23:49
  • Give a look at [JTidy](http://jtidy.sourceforge.net/). Can solve your problem to extract information from HTML. – araknoid Jul 29 '13 at 15:10

2 Answers2

0

Assuming you're trying to extract the number located in the first td cell, searching for the tag start and end and using substring to extract the contents is a much easier approach than a regular expression.

// text contains the HTML from your question

int tdIndex = text.indexOf("<td");
int endTdIndex = text.indexOf(">", tdIndex + 1);
int endTdTagIndex = text.indexOf("</td>", endTdIndex + 1);

String numString = text.substring(endTdIndex + 1, endTdIndex - 1);

// numString now contains 69144

If you need the contents of a td cell from deeper into the HTML, then you can search for later td tags by using the following in a loop:

tdIndex = text.indexOf("<td",tdIndex+1);

You'll have to know which td tag you're after (e.g., "the third td") and know that there will always be the same number of td tags ahead of it, but given those two assumptions this code will work for you with minimal modification.

If you can't make assumptions about the format of the code, then I second Reimeus' answer that an HTML parser can prove quite useful.

Jeff
  • 126
  • 3
0

One way that you could parse the HTML is by using XPath, an included library for java. What XPath does is traverse the "tree" of XML/HTML document and get values of the nodes (content within the tags). The library is easy to use, easy to learn, and requires no downloading libraries. More can be found on this topic on the New Think Tank Xpath Tutorial

S0urce C0ded
  • 116
  • 1
  • 8