1

Disclaimer: I know that parsing HTML with regex is not the correct approach. I am actually just trying to parse text inside the HTML.

I am parsing several pages, and I am looking for prices. Here is what I have so far:

var all = document.body.querySelectorAll(":not(script)");
var regex = /\$[0-9,]+(\.[0-9]{2})?/g;

for (var i = 0; i < all.length; i++) {

    var node_value = all[i].nodeValue;
        for (var j = 0; j < all[i].childNodes.length; j++) {

            var node_value = all[i].childNodes[j].nodeValue;
            if (node_value !== null) {

                var matches = node_value.match(regex);
                if (matches !== null && matches.length > 0) {

                    alert("that's a match");
                }
            }
        }
}

This particular code can get me prices like this:

<div>This is the current price: <span class="current">$60.00</span></div>

However, there are some prices that have the following structure:

<div>This is the current price: <sup>$</sup><span>80.00</span></div>

How could I improve the algorithm in order to find those prices? Shall I look in the first for loop for <sup>symbol</sup><span>price</span> with regex?

Important: Once a match, I need to findout which DOM element is holding that price. The most inner element that is holding the price. So for example:

<div><span>$80.00</span></div>

I would need to say that is the element that is holding the price, not the div.

Hommer Smith
  • 26,772
  • 56
  • 167
  • 296
  • how about going with just the decimal separator and the two digits that follow? – Wim Ombelets Oct 30 '13 at 22:27
  • @Wim Ombelets the problem by doing so is that I can get false positives...Notice that some prices don't have any digits that follow the ".". So they might be $80. If I do a regex that just looks for two digits, I will get lots of false positives. WIth the $ I ensure that it's currency... – Hommer Smith Oct 30 '13 at 22:32
  • I created a fiddle for you approach here: http://jsfiddle.net/powtac/fWexh/ – powtac Oct 30 '13 at 22:39

2 Answers2

1

Try this:

var text = document.body.textContent || document.body.innerText,
    regex = /\$\s*[0-9,]+(?:\s*\.\s*\d{2})?/g,
    match = text.match(regex);
if( match) {
    match = match[0].replace(/\s/g,"");
    alert("Match found: "+match);
}

Using a recursive search:

function findPrice(node) {
    node = node || document.body;
    var text = node.textContent || node.innerText,
        regex = /\$\s*[0-9,]+(?:\s*\.\s*\d{2})?/,
        match = text.match(regex);
    if( match) {
        var children = node.children, l = children.length, i;
        for( i=0; i<l; i++) {
            if( findPrice(children[i])) {
                return children[i];
            }
        }
        // if no children matched, then this is the narrowest container
        return node;
    }
    else return false;
}
var result = findPrice();
Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • Niet, one thing here... I need to, once I find a match, save the element that contained that "price". So, know in which span/div/or any container it was found. By doing this strategy I would assume everything is text, right? – Hommer Smith Oct 30 '13 at 22:35
  • Very interesting: `document.body.textContent || document.body.innerText` – powtac Oct 30 '13 at 22:41
  • @Niet the Dark Absol: So, the problem with this approach is that if I get the text for all the elements (my outter FOR), I might get a match, but that price might be inside another HTML element, and I won't be able to know which element is that... – Hommer Smith Oct 30 '13 at 22:59
  • 1
    Hmm... Perhaps you could search for the price, then refine your search: go through the `children[]` array of the element. For each one, check the `textContent||innerText` to see if it matches. If it does, continue searching deeper. If it doesn't, then the current element is the closest container to the price. – Niet the Dark Absol Oct 30 '13 at 23:57
  • Niet the Dark Absol. The problem if I do this is...Imagine
    $89.99
    -- Once I am in the div, I would say that it matches, but then I would go deeper because the inner of the matches too. But I would actually want the match from before. You know?
    – Hommer Smith Oct 31 '13 at 00:05
  • Not if you match the full regex each time. The div would match, because it contains the full price string. However, the span does not match (no dollar sign) so the recursion would stop at the div. – Niet the Dark Absol Oct 31 '13 at 00:14
  • Niet the Dark Absol I have a question here... If I do a for, for all the elements (like I do in my initial question), and I have this
    $30
    . I would have
    and as elements to be looped. I would get the $30 for both elements. How could I avoid that? Could you edit your answer to help me with that?
    – Hommer Smith Oct 31 '13 at 00:27
  • In my recursive solution, you would match the `div`, then match the `span` and that would be your answer. – Niet the Dark Absol Oct 31 '13 at 10:02
  • Hi. I just noticed that there is a problem with this approach. If there is a container which looks like this:
    $130.90 $30.0
    - When parsing the div, it will find $130.90 but since there are children with prices too, we wouldn't get it. How could I workaround this situation?
    – Hommer Smith Jan 30 '14 at 20:32
0

If you can choose your browser, you might use XPath to pre-select your candidates. The following code finds candidates nodes. I tried it in Firefox 25. You might also want to look at What browsers support Xpath 2.0? and http://www.yaldex.com/ajax-tutorial-4/BBL0029.html for cross-browser approaches.

<html><head><script type="text/javascript">
function func() {
  //span containing digits preceeded by superscript dollar sign
  var xpathExpr1 = "//span[translate(text(),'0123456789.,','')!=text()][preceding-sibling::sup[text()='$']]";
  //span containing digits and starting with dollar sign
  var xpathExpr2 = "//span[translate(text(),'0123456789.,','')!=text() and contains(text(),'$')]";
  var xpathExpr3 = xpathExpr1 + "|" + xpathExpr2; // union
  var contextNode = document.body;
  var namespaceResolver = function(prefix){return "";}
  var resultType = XPathResult.UNORDERED_NODE_ITERATOR_TYPE;
  var xpathResult = document.evaluate(xpathExpr1, contextNode, namespaceResolver, resultType, null);
  alert(xpathResult);
  var node;
  while ((node = xpathResult.iterateNext()) != null) {
      alert(node.textContent);
  }
}
</script></head>
<body onload="func()"> aaa
<sup>$</sup><span>80.00</span> bbb
<span>$129</span> ccc
<sup>$</sup><span>ABC</span> ddd
</body></html>
Community
  • 1
  • 1
halfbit
  • 3,414
  • 1
  • 20
  • 26