1

I have following table cell:

<td class="text-right"
                onmouseenter="$(this).find('.overlay-viewable-box:first').show();"
                onmouseleave="$(this).find('.overlay-viewable-box:first').hide();">




                2.004





            </td>

It contains spaces and line breaks too. The class="text-right" isn't unique on the page, but the first - if it could help to relate on it.

I want to match only number (this one - 2.004, or any other, it is always only one number) - with or without the point and/ or comma in it.

PS: yes, i fully agreed that the idea to parse html with regex is not the best - any other method would be such kind of overhead, that it would be not worth to do:(

PPS: guys and guls - please write your recommendations as answers, not as comments, so i could accept and honorate them.

Solution: (?:<td\b.*?text-right\b.*?\D*?;">)([\s\S\d]*?)(?=\D*?<\/)

Edit: full length HTML:

<div class="box    " >

        <div class="box-head    " >
            <div class="box-icon">
            <span class="icon ">&#xf0ae;</span>        </div>
        <span class="divider"></span>

                    <div class="box-title box-title-space-1">
            <span>Keyword-Profile</span></div>

    <div class="box-options dropdown  box-options-no-divider">


            <div class="divider "></div>
        <div class="box-icon "><a
                    class="button">
                <span class="icon ">&#xf013;</span>            </a></div>

        <ul class="dropdown-menu">


                                <li
                                    >                        <a   onclick="" class="modal"><div><div class="icon"><div>&#xf055;</div></div><div class="text"> Add to Dashboard</div></div></a>
                                    </li>

                                <li
                                    ><span class="box-menu-seperator"></span>                        <a   onclick="
                                                                                " href="" class="modal"><div><div class="icon"><div>&#xf055;</div></div><div class="text"> Add to Report</div></div></a>
                                    </li>

        </ul>

</div>

</div>
<div class="module-loading-blocker">
    <div class="module-loading-blocker-icon">
        <div style="width: 40px; height: 40px; display: inline-block;">
    <svg width="100%" height="100%" class="loading-circular" viewBox="0 0 50 50">
        <circle class="loading-path" cx="25" cy="25" r="20" fill="none" stroke-width="5" stroke-miterlimit="10"/>
    </svg>
</div>    </div>
</div>
    <div class="box-content box-body box-table" >    <table class="table table-spaced">
            <tr>
            <td>




                            Top-10





            </td>

            <td class="text-right"
                onmouseenter="$(this).find('.overlay-viewable-box:first').show();"
                onmouseleave="$(this).find('.overlay-viewable-box:first').hide();">




                2.004





            </td>
        </tr>
            <tr>
            <td>




                            Top-100





            </td>

            <td class="text-right"
                onmouseenter="$(this).find('.overlay-viewable-box:first').show();"
                onmouseleave="$(this).find('.overlay-viewable-box:first').hide();">




                237.557





            </td>
        </tr>
            <tr>
            <td>




                            &empty; Position





            </td>

            <td class="text-right"
                onmouseenter="$(this).find('.overlay-viewable-box:first').show();"
                onmouseleave="$(this).find('.overlay-viewable-box:first').hide();">




                60





            </td>
        </tr>
        </table>
</div></div><div class="module" style="display: none;">x</div>
Evgeniy
  • 2,337
  • 2
  • 28
  • 68
  • `(?<=>)[0-9]+(\.[0-9]+)?(?=<)` – iBug Jan 30 '18 at 11:47
  • what overhead? how many values are to be extracted, how many lines of html? extracting values from markup with regex only is the most error prone solution. if your using js, at least select the textnodes and then match them with \d+(\.\d+)? – lher Jan 30 '18 at 11:58

3 Answers3

1

Update (JavaScript RegExp)


To get the number within <td>

Ignoring the fact code will not function and to provide a Regex that'll get the number in the first td.text-right only try this:

/(?:<td\b.*?text-right\b.*?\D*?)([0-9]+?[.,]*?[0-9]*?)(?=\D*?<\/)/

|1|]=-------------------------------------=[|2|]=-----------------------=[|3|]=------------=|]

  1. begin non-capture (?: literal <td word border d\s & zero to any number of char until \b.*? literal text-right word border t\s & zero to any number of char until \b.*? zero to any number of char that is not a number until \D*? end non-capture )

  2. begin capture ( one to any number of numbers until [0-9]+? zero to any number of a literal . or , until [.,]*? zero to any number of numbers until [0-9]*? end capture )

  3. begin positive look ahead (?= of zero to any number of any non-number char until \D*? literal with escaped forward slash <\/ end-positive look ahead )


Better Regex

This one concentrates on the fact that each target is on the last column by adding: <\/td>\s*?</tr> in a positive look ahead.

/\b([0-9]+?[.,]*?[0-9]*?)(?=\D*?<\/td>\s*?<\/tr>)/g;

It has a cleaner result both matching and capture groups are the same. No side effect non-capturing group.

Demo

var rgx = /\b([0-9]+?[.,]*?[0-9]*?)(?=\D*?<\/td>\s*?<\/tr>)/g;

var str = document.documentElement.innerHTML;

let hits;

while ((hits = rgx.exec(str)) !== null) {

    if (hits.index === rgx.lastIndex) {
        rgx.lastIndex++;
    }
    
    hits.forEach(function(hit, idx) {
        console.log(`Found match, group ${idx}: ${hit}`);
    });
}
<div class="box    ">

  <div class="box-head    ">
    <div class="box-icon">
      <span class="icon ">&f0ae;</span> </div>
    <span class="divider"></span>

    <div class="box-title box-title-space-1">
      <span>Keyword-Profile</span></div>

    <div class="box-options dropdown  box-options-no-divider">


      <div class="divider "></div>
      <div class="box-icon ">
        <a class="button">
          <span class="icon ">&f013;</span> </a>
      </div>

      <ul class="dropdown-menu">


        <li>
          <a onclick="" class="modal">
            <div>
              <div class="icon">
                <div>&f055;</div>
              </div>
              <div class="text"> Add to Dashboard</div>
            </div>
          </a>
        </li>

        <li><span class="box-menu-seperator"></span>
          <a onclick="
                                                                                " href="" class="modal">
            <div>
              <div class="icon">
                <div>&f055;</div>
              </div>
              <div class="text"> Add to Report</div>
            </div>
          </a>
        </li>

      </ul>

    </div>

  </div>
  <div class="module-loading-blocker">
    <div class="module-loading-blocker-icon">
      <div style="width: 40px; height: 40px; display: inline-block;">
        <svg width="100%" height="100%" class="loading-circular" viewBox="0 0 50 50">
        <circle class="loading-path" cx="25" cy="25" r="20" fill="none" stroke-width="5" stroke-miterlimit="10"/>
    </svg>
      </div>
    </div>
  </div>
  <div class="box-content box-body box-table">
    <table class="table table-spaced">
      <tr>
        <td>




          Top-10





        </td>

        <td class="text-right" onmouseenter="\$(this).find('.overlay-viewable-box:first').show();" onmouseleave="\$(this).find('.overlay-viewable-box:first').hide();">




          2.004





        </td>
      </tr>
      <tr>
        <td>




          Top-100





        </td>

        <td class="text-right" onmouseenter="\$(this).find('.overlay-viewable-box:first').show();" onmouseleave="\$(this).find('.overlay-viewable-box:first').hide();">




          237.557





        </td>
      </tr>
      <tr>
        <td>




          &empty; Position





        </td>

        <td class="text-right" onmouseenter="\$(this).find('.overlay-viewable-box:first').show();" onmouseleave="\$(this).find('.overlay-viewable-box:first').hide();">




          60





        </td>
      </tr>
    </table>
  </div>
</div>
<div class="module" style="display: none;">x</div>
Community
  • 1
  • 1
zer00ne
  • 41,936
  • 6
  • 41
  • 68
  • Added full source code. And - it is maybe the worst ide of bad ideas, but... my matching needs are such primitive, that i till now always matched with help of community every value i needed :) Be not so straight - i surely know, that there could many more elegant ways to achive the goal, but - this way is the most efficient one. – Evgeniy Jan 30 '18 at 14:03
  • i'm not the author of this code. i just want to get the value from it. But i gladly notify the author of the code about your bold thoughts, even if they are sadly valueless in terms of solving of described issue. – Evgeniy Jan 30 '18 at 14:52
  • See bottom of this answer – zer00ne Jan 30 '18 at 15:50
  • Your regex is _nearly_ working - it filters commas and points out - it gets from 2,3 - 23, and from 2.3 - 23 too. This regex is _nearly_ working too, `(?<=>\s*)([0-9]+)\,+([0-9]+)(?=\s*<)` - but replaces comma with pipe - from 2,3 it gets 2|3. – Evgeniy Jan 30 '18 at 15:57
  • Your regex is working like expected only in cases the digit before comma is not `0` - means, for values `0,1-0,9` it works. If the value before comma isn't `0`- as i said, comma is filtered out – Evgeniy Jan 30 '18 at 16:08
  • Got it finally, based on your regex: `(?:)([\s\S\d]*?)(?=\D*?<\/)` – Evgeniy Jan 30 '18 at 17:36
  • @ChillyBang I updated with a cleaner regex based on column position instead of the `class="text-right"`Added a working **Demo** as well. – zer00ne Jan 30 '18 at 20:27
  • Are you sure it tests out ok? `([\s\S\d]*?)` Anything within brackets `[ ...]` is a class. A regex class is a representation of a single literal that will occupy that spot. So `[\s\S\d]*?` means this: **at this position, zero or more `\`, `s`, `S`, `d` will match** If you remove the brackets it will be a capture group: `\s\S` is a DOTALL a.k.a. `,` which is equivalent to any character so it's:**zero or more of any char until...** which makes the `\d` a moot point. Of course it could work different for you if you are using PCRE, since my knowledge is limited to JavaScript. – zer00ne Jan 30 '18 at 20:43
0

A simple solution, provided that your parsing engine can search across lines, and supports lookarounds:

(?<=>\s*)([0-9]+(?:\.[0-9]+)?)(?=\s*<)

Explained:

The first part is (?<=>). (?<=regex) is called a positive lookbehind, which tells the parser to check if a pattern matching regex exists before the actual matching part. In this case it will look for any number of whitespaces after a >.

The core part, [0-9]+(\.[0-9]+)? matches one or more digits, optionally followed by a dot and another group of one or more digits. The last ? indicates that the decimal part is optional.

The last part is (?=<). (?=regex) is called a positive lookahead, which tells the parser to check if a pattern matching regex exists after the actual matching part. In this case it will look for any number of whitespaces, followed by a <.

iBug
  • 35,554
  • 7
  • 89
  • 134
0

Assuming your regex engine understands pcre, try

/>[\s]*([[:digit:]]+(\.[[:digit:]]+)?)[\s]*<\//g

to match a number optionally surrounded by whitespace ( including newline/linefeed characters ) which is the sole textual content of a html element. Capture group 1 holds the number.

You may need to adjust the pattern inside the capture group to cater for the kind of lexiclaisations you'd consider a 'number'.

Drop the start and the end of the expression ( ie. >, <\/ ) if the assumed structural html context is too restrictive for your purposes. Given your question you are aware that doing so increases the risk of false positives.

See it live at Regex101

Btw there are html parser libraries for most programming languages that allow for parsing lenient to syntax errors and sport simple interfaces to iterate over all textual content. Just for the sake of the argument, if jQuery or some similar functionality is available, you may proceed along the lines of this SO answer ( just replace the inner return expression with a regex test, like (untested code):

var re = RegExp('[[:digit:]]+(\.[[:digit:]]+)?', 'g');
$.fn.findByREText = function (re) {
    $('*').contents().filter(function () {
        return re.test($(this).text.trim());
    });
};
collapsar
  • 17,010
  • 4
  • 35
  • 61