Is there an easier way to scrape data based upon text

Question

Trying to scrape data from elements in a plain table, not all cells are required. The information is contained into the cells like the sample provided below:

<TD class=padded vAlign=top width="10%">
   <SPAN class=bold>Record No:</SPAN>
   <BR>40597
</TD>

In this example I am trying to extract the value for the field, which is 40597.

I have been able to use jQuery so far to find each td element like so:

function getHtmlDoc(data){
  var el = document.createElement('html');
  el.innerHTML = data;
  $.each($('.padded',el),function(index,item){
        if($(this).text().indexOf("Record No:")>=0){
          console.log(index + " " + $(this).text());
        }
  });
}

This returns

Record No:
              40597

I just want the last part.

I could add steps to remove the text Record No: and than trim the whitespace to obtain the value.

Is there a better solution? I have to do this method a few items and there are numerous entries on each page using a similar displayed above.

Possible duplicate of [Using .text() to retrieve only text not nested in child tags](https://stackoverflow.com/questions/3442394/using-text-to-retrieve-only-text-not-nested-in-child-tags) — Alexandre Elshobokshy, Jan 17 '19 at 15:20
I read that example, and I wondered if it would be applicable... and I wondered if it was really more efficient. — mcv, Jan 17 '19 at 15:45

Mosè Raguzzini · Accepted Answer · 2019-01-17T15:27:31.817

Although this is not perfect, when you are seeking for simple text in DOM, I prefer to work with nodes.

This is a vanilla javascript example:

var oDiv = document.getElementsByClassName("padded")[0];
var lastText = "";
for (var i = 0; i < oDiv.childNodes.length; i++) {
    var curNode = oDiv.childNodes[i];
    if (curNode.nodeName === "#text") {
        lastText = curNode.nodeValue;
    }
}
console.log(lastText);

<TABLE>
  <TD class='padded' vAlign='top' width="10%">
     <SPAN class='bold'>Record No:</SPAN>
     <BR />40597
  </TD>
</TABLE>

jQuery flavour without nodes but with some tricks

const node = $(".padded")
        .clone()    //clone the element
        .children() //select all the children
        .remove()   //remove all the children
        .end()  //again go back to selected element
        .text()
        .trim();
  
console.log(node);

<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<TABLE>
  <TD class='padded' vAlign='top' width="10%">
     <SPAN class='bold'>Record No:</SPAN>
     <BR />40597
  </TD>
</TABLE>

Ref: Using .text() to retrieve only text not nested in child tags

would curNode.nodeName === "Record No:" ? I have to apply this process to find than just this entry. There are 6 items I need to extract from over 50. — mcv, Jan 17 '19 at 15:22
nope, curNode.nodeName only contains values described in https://developer.mozilla.org/it/docs/Web/API/Element/nodeName — Mosè Raguzzini, Jan 17 '19 at 15:26
Interesting. I will attempt it when I get the remainder of the searches I require all placed out. — mcv, Jan 17 '19 at 15:46

score -1 · Answer 2 · answered Jan 17 '19 at 14:55

-1

Try a Regular Expression to parse the number out directly:

function getHtmlDoc(data){
  var el = document.createElement('html');
  el.innerHTML = data;
  $.each($('.padded',el),function(index,item){
        if($(this).html().match(/<SPAN class=bold>Record No:<\/SPAN>[\s\S]*?<BR>([0-9]+)/i)){
          console.log(index + " " + RegExp.$1);
        }
  });
}

answered Jan 17 '19 at 14:55

IceMetalPunk

5,476
3
19
26

1

Please [do not use regular expressions on html](https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) – Alexandre Elshobokshy Jan 17 '19 at 15:02
The advice against using regex on HTML comes down to "it can't parse every form of HTML". But in this case, the format is well-known in advance and presumably isn't going to change, so it doesn't *need* to parse every form of HTML; is there a reason why it shouldn't be done in this specific instance? – IceMetalPunk Jan 17 '19 at 16:43
it's a very bad practice, if you can do it without regex, do it. That's all there is to it :) – Alexandre Elshobokshy Jan 17 '19 at 17:24
I'm always wary of people who say something is "bad" without a reason. That applies more generally in life, too, but also includes programming. If there's an efficiency concern, or a security risk, or something like that, then sure, I get it. But if it's "bad just because it's bad", I don't believe that's a valid reason to avoid code that works perfectly well. – IceMetalPunk Jan 17 '19 at 17:42

Is there an easier way to scrape data based upon text

2 Answers2