0

i have gone though this post why not use regular expression for HTML. As a part of the task given to me, i had no choice but to use regular expression for HTML.

i have HTML code and separately tried like

 <td class="a-nowrap">

          <span class="a-letter-space"></span><span>13</span>

        </td>

i have been able to get the 13 using following regular expression :

<td class="a-nowrap">\s*<span class="a-letter-space"></span><span>(\d*)</span>\s*</td>

and similarly from

<td class="a-nowrap">

          <a class="a-link-normal" title="69% of reviews have 5 stars" href="">5 star</a><span class="a-letter-space"></span>          

        </td>

got 5 star using the regular expression

<td class="a-nowrap">\s*<a class="a-link-normal" [^>]*>\s*(.*)</a>\s*</td>

But when both of the HTML code is combined like,

<table id="histogramTable" class="a-normal a-align-middle a-spacing-base">

  <tr class="a-histogram-row">



        <td class="a-nowrap">

          <a class="a-link-normal" title="69% of reviews have 5 stars" href="">5 star</a><span class="a-letter-space"></span>          

        </td>

        <td class="a-span10">

          <a class="a-link-normal" title="69% of reviews have 5 stars" href=""><div class="a-meter"><div class="a-meter-bar" style="width: 69.1358024691358%;"></div></div></a>

        </td>

        <td class="a-nowrap">

          <span class="a-letter-space"></span><span>13</span>

        </td>

  </tr>
  <td class="a-nowrap">

      <a class="a-link-normal" title="2% of reviews have 1 stars" href="">1 star</a><span class="a-letter-space"></span>          

    </td>

    <td class="a-span10">

      <a class="a-link-normal" title="2% of reviews have 1 stars" href=""><div class="a-meter"><div class="a-meter-bar" style="width: 2.46913580246914%;"></div></div></a>

    </td>

    <td class="a-nowrap">

      <span class="a-letter-space"></span><span>2</span>

    </td>


</table>

how to extract 5 star and 13 using regular expression?

Community
  • 1
  • 1
navyad
  • 3,752
  • 7
  • 47
  • 88
  • updated my answer with new shorter regex, which works for the modified input you have provided. – Tafari Nov 11 '13 at 14:58

1 Answers1

1

If you don't want to use HTML parser, use one regex after another or add .*this between two patterns, I have modified a bit your star regex as it didn't work properly:

First enable dotall flag (s) and then use this:

<td class="a-nowrap">\s*<a class="a-link-normal" [^>]*>\s*(\d star).*<td class="a-nowrap">\s*<span class="a-letter-space"></span><span>(\d*)</span>\s*</td>

Output:

Group 1: 5 star

Group 2: 13

EDIT:

I have made shorter regex:

REGEX:

>(\d star)<.+?>(\d+?)<

Which used on pythonregex.com with the edited input you have provided gives:

OUTPUT:

>>> regex.findall(string)
[(u'5 star', u'13'), (u'1 star', u'2')]
Community
  • 1
  • 1
Tafari
  • 2,639
  • 4
  • 20
  • 28
  • using above expression, it will be like [('5 star', ''), ('', '13')] but i want something like [('5 star', '13')], '|' or expression making this trouble. any idea on that? – navyad Nov 09 '13 at 11:34
  • @naveenyadav that's strange as I use the patterns you have provided, just added **OR** between them, so the pattern will catch either ** 5 stars** and/or *13*. Do these patterns work for you when you use them separately? – Tafari Nov 09 '13 at 11:38
  • @naveenyadav well so you almost get what you want : ) ok so let me think a bit. – Tafari Nov 09 '13 at 11:41
  • @naveenyadav well you get that output as it matches both cases, but you have both results you wanted, so you could use them as you wished right? Unfortunately I'm not able to check how does this regex work properly as I have never used regex for HTML : ( – Tafari Nov 09 '13 at 11:52
  • @naveenyadav I would help further but I am unable to check the results. I would suggest you to maybe add `.*` instead of `|` it might help. – Tafari Nov 09 '13 at 12:19
  • @naveenyadav I have modified my answer, in the site (pythonregex.com) you have provided it gives following output: [(u'5 star', u'13')]. (remember to enable dot all flag) – Tafari Nov 10 '13 at 11:08
  • yes i checked it was working , Thanks. But when i have added one data in above HTML snippet. it is fetching the values from one only. I have changed the Html snippet. – navyad Nov 11 '13 at 11:52
  • 1
    code is working fine. I appreciate your effort to help me out. thanks – navyad Nov 11 '13 at 15:18