2

I have a problem with my regular expression. I need to match blocks of HTML.

Example-Block here:

<tr class="tr-list " data-id="XX">
    <td class="ip-img"><div class="gun-icon"></div><img src="https://example.com/images/stories/HCP/HCP_5.jpg"/></td>
    <td class="ip-name ip-sort">Hotel Complex Project</td>
    <td class="ip-price ip-sort">297.00</td>
    <td class="ip-earnings ip-sort">43</td>
    <td class="ip-shares ip-sort">86</td>
    <td class="ip-status {'sorter':'currency'}"><img
            src="/img/assets/arrow1.png" title="0.989990234375"/></td>
    <td class="ip-blank-right"></td>
</tr>

Everyone of these blocks of HTML should match separately which I then want to extract the other data from (eg. ip-name, ip-price, ip-earnings..).

But my current regex matches everything until the "(?=)"-part is not true anymore: http://regexhero.net/tester/?id=2b491d15-ee83-4dc7-8fe9-62e624945dcf

What do I need to change to have every block as a match?

Greetings! :)

PS.: Hope it is understandable what I mean...

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
reijin
  • 121
  • 3
  • 9
  • 3
    [Parsing HTML with regex?](http://stackoverflow.com/a/1732454/1493698) – Antony Mar 24 '13 at 19:11
  • 2
    ach, come on... really? This app will only read some content off a website - nothing more, nothing less. – reijin Mar 24 '13 at 19:21
  • 1
    @reijin Why not - it's still just as easy and less painful when your regex breaks to use an HTML parser to start with... – Jon Clements Mar 24 '13 at 19:27
  • @JonClements I understand the problem, it's just that I'm less familiar with HTML parsers... But I will definitely check that! – reijin Mar 24 '13 at 19:33

3 Answers3

5

This should get all the tr rows:

<tr class="tr-list[\s\S]+?</tr>

This should get all the tr rows with matching groups for the columns:

<tr class="tr-list[^<]*?<td class="ip-img">(.*?)</td>\s*<td class="ip-name.*?">(.*?)</td>\s*<td class="ip-price.*?">(.*?)</td>\s*<td class="ip-earnings.*?">(.*?)</td>\s*<td class="ip-shares.*?">(.*?)</td>\s*<td class="ip-status.*?">([\s\S]*?)</td>[\s\S]+?</tr>
eyllanesc
  • 235,170
  • 19
  • 170
  • 241
Kevin Collins
  • 1,453
  • 1
  • 10
  • 16
  • thanks, works perfectly! At least in the online tester... Guess I'll need to convert it to work with python. – reijin Mar 24 '13 at 19:30
0

nested html will require nested array from regular expression's match it can be done using jquery or manually generate a tree using regular expression

Saad Ahmed
  • 1,077
  • 9
  • 9
0

This Regular Expression will capture a whole html block that is not self-enclosed:

var hmtlText="<div bar='baz'>foo</foo>";
var pattern = /<([\w]+)( (( +)?[\w]+=['"](\w+)?['"])?)+( )?(\/)?>((([\t\n\r\s]+)?)+(((.)+)?)+((\10)?)+)+?<\/(\1)>/igm;
console.log((pattern.test(htmlText) ? 'valid' : 'invalid') + ' html block');
Alan R. Soares
  • 1,735
  • 1
  • 15
  • 19