0

Trying to scrape data from elements in a plain table, not all cells are required. The information is contained into the cells like the sample provided below:

<TD class=padded vAlign=top width="10%">
   <SPAN class=bold>Record No:</SPAN>
   <BR>40597
</TD>

In this example I am trying to extract the value for the field, which is 40597.

I have been able to use jQuery so far to find each td element like so:

function getHtmlDoc(data){
  var el = document.createElement('html');
  el.innerHTML = data;
  $.each($('.padded',el),function(index,item){
        if($(this).text().indexOf("Record No:")>=0){
          console.log(index + " " + $(this).text());
        }
  });
}

This returns

Record No:
              40597

I just want the last part.

I could add steps to remove the text Record No: and than trim the whitespace to obtain the value.

Is there a better solution? I have to do this method a few items and there are numerous entries on each page using a similar displayed above.

mcv
  • 1,380
  • 3
  • 16
  • 41
  • Possible duplicate of [Using .text() to retrieve only text not nested in child tags](https://stackoverflow.com/questions/3442394/using-text-to-retrieve-only-text-not-nested-in-child-tags) – Alexandre Elshobokshy Jan 17 '19 at 15:20
  • I read that example, and I wondered if it would be applicable... and I wondered if it was really more efficient. – mcv Jan 17 '19 at 15:45

2 Answers2

2

Although this is not perfect, when you are seeking for simple text in DOM, I prefer to work with nodes.

This is a vanilla javascript example:

var oDiv = document.getElementsByClassName("padded")[0];
var lastText = "";
for (var i = 0; i < oDiv.childNodes.length; i++) {
    var curNode = oDiv.childNodes[i];
    if (curNode.nodeName === "#text") {
        lastText = curNode.nodeValue;
    }
}
console.log(lastText);
<TABLE>
  <TD class='padded' vAlign='top' width="10%">
     <SPAN class='bold'>Record No:</SPAN>
     <BR />40597
  </TD>
</TABLE>

jQuery flavour without nodes but with some tricks

const node = $(".padded")
        .clone()    //clone the element
        .children() //select all the children
        .remove()   //remove all the children
        .end()  //again go back to selected element
        .text()
        .trim();
  
console.log(node);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<TABLE>
  <TD class='padded' vAlign='top' width="10%">
     <SPAN class='bold'>Record No:</SPAN>
     <BR />40597
  </TD>
</TABLE>

Ref: Using .text() to retrieve only text not nested in child tags

Mosè Raguzzini
  • 15,399
  • 1
  • 31
  • 43
  • would curNode.nodeName === "Record No:" ? I have to apply this process to find than just this entry. There are 6 items I need to extract from over 50. – mcv Jan 17 '19 at 15:22
  • nope, curNode.nodeName only contains values described in https://developer.mozilla.org/it/docs/Web/API/Element/nodeName – Mosè Raguzzini Jan 17 '19 at 15:26
  • Interesting. I will attempt it when I get the remainder of the searches I require all placed out. – mcv Jan 17 '19 at 15:46
-1

Try a Regular Expression to parse the number out directly:

function getHtmlDoc(data){
  var el = document.createElement('html');
  el.innerHTML = data;
  $.each($('.padded',el),function(index,item){
        if($(this).html().match(/<SPAN class=bold>Record No:<\/SPAN>[\s\S]*?<BR>([0-9]+)/i)){
          console.log(index + " " + RegExp.$1);
        }
  });
}
IceMetalPunk
  • 5,476
  • 3
  • 19
  • 26
  • 1
    Please [do not use regular expressions on html](https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) – Alexandre Elshobokshy Jan 17 '19 at 15:02
  • The advice against using regex on HTML comes down to "it can't parse every form of HTML". But in this case, the format is well-known in advance and presumably isn't going to change, so it doesn't *need* to parse every form of HTML; is there a reason why it shouldn't be done in this specific instance? – IceMetalPunk Jan 17 '19 at 16:43
  • it's a very bad practice, if you can do it without regex, do it. That's all there is to it :) – Alexandre Elshobokshy Jan 17 '19 at 17:24
  • I'm always wary of people who say something is "bad" without a reason. That applies more generally in life, too, but also includes programming. If there's an efficiency concern, or a security risk, or something like that, then sure, I get it. But if it's "bad just because it's bad", I don't believe that's a valid reason to avoid code that works perfectly well. – IceMetalPunk Jan 17 '19 at 17:42