0

How can I combine 3 regex patterns into 1 expression - if it is possible?
I want to get the first th tag value the first td tag value and the id from the a tag using a proper regex for this. I've been struggling for an hour to get them all in 1 expression.What would be the solution?

 regex for th tag:  
 th[^>]+l">([^<]+)</th  
 regex for td tag:  
 td>([^<]+)</td  
 regex for a tag:
 <a((?!</a).)id="([^"]+)" 

I have a list of items like this snippet.

    ...
    <th scope="col">1X2</th>
    <th scope="col" class="goR">Odds</th>
    </tr></thead>
    <tbody>
    <tr class="row1">
    <td>Fortuna Köln</td>
    <td class="prc "><label><a id="MarketGroupListComponent25-selection-38225206.1" />
    ...
SzabK
  • 75
  • 1
  • 7
  • 3
    what about a proper html parsing library which gives you way simpler extraction possibilities like: https://jsoup.org/cookbook/extracting-data/selector-syntax - Your regex is screwed if anyone adds a comment into the html table for example – zapl Dec 03 '16 at 13:04
  • thanks for the suggestion, I will definitely check that out! – SzabK Dec 03 '16 at 13:25
  • @zapl your parser is screwed if the html isn't properly written, for example with a non-closed p tag. A parser might be the best solution here, but isn't always the best solution. – Eric Duminil Dec 03 '16 at 13:56
  • https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Zlatin Zlatev Nov 13 '17 at 13:45

1 Answers1

1

Here's a possible solution :

(?s)th[^>]+l">(.*?)<\/th>.*?<td>(.*?)<\/td>.*?<a id="(.*?)"

You need the (?s) modifier to make . match a newline. The 3 desired strings are in groups 1, 2 and 3.

You don't need any lookahead in this case.

See it in action

Note :

This Regex will fail for many weird cases, e.g. escaped \" in id or values containing th or th. If you know that the html is valid, you could use a Java HTML parser for a more complex query. This parser also could fail if html isn't valid or if the html structure has changed.

Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
  • 1
    Any comment for the downvote? The question asked a Regex, I gave a Regex. Depending on the variability of file format, regex could be a good idea, a parser could be a better idea. – Eric Duminil Dec 03 '16 at 13:44