2

I have a webpage that I am scraping for info. In the webpage everything I need is within separate divs with a specific class.

For example:

<div class="temp">text </div>

The issue is that there is different amounts of these divs each day, some days there are 5, then maybe 10 or 12. After the divs I need are more divs with the same class but have info I do not need. In the html there is a comment line separating the two. Like so:

<div class="temp">text </div>
<div class="temp">moretext </div>
<!-- beginning of historical data -->
<div class="temp">text </div>

I'm currently getting the divs with

var temps = window._document.getElementsByClassName('temp')
for (var I = 0; I  < temps.length; i++){
var a = temps [i].getElementsByTagName('a')
var text = temps [i].textContent
//do something with vars }

That's working great, but since I don't know how many divs are before the comment I can't limit the for loop to just what I need and either pull everything, including what I don't need, if I set a limit I either pull too much or too little.

Is there a way to pull just the divs before the comment?

jcalton88
  • 160
  • 1
  • 12

1 Answers1

0

This does what you described for the example HTML you gave, but it's assuming that both the interesting div elements and the comments are children of the body element and that there is only one comment in the document.

General concept is to find the index of the comment tag and only process divs that have a lower index.

(Another assumption is that your browser is ECMA-6)

function doSomethingWithTemps() {
    var commentIndex = $('*').contents().filter( (i,v) => v.nodeType == 8).index();
    $('.temp').filter( (i,v) => $(v).index() < commentIndex ).each( (i,v) => console.log(v.textContent) );
}

function nonEcma6() {
    var commentIndex = $('*').contents().filter( function(i,v) { return v.nodeType == 8 } ).index();
    console.log("Index: "+commentIndex);
    $('.temp').filter( function(i,v) { return $(v).index() < commentIndex } ).each( function(i,v) { console.log(v.textContent) } );
}

$(nonEcma6);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<body>
<div class="temp">text </div>
<div class="temp">moretext </div>
<!-- beginning of historical data -->
<div class="temp">text </div>
</body>

Code for finding comment tags is from Selecting HTML Comments with jQuery

Community
  • 1
  • 1
Tibrogargan
  • 4,508
  • 3
  • 19
  • 38
  • I will give this a go first thing tomorrow. If this works well I will be done and able to move onto a new project. Your assumptions were correct as well. – jcalton88 Nov 11 '16 at 07:38
  • This doesn't appear to be working. It is running in a jsdom window in nodejs, would that make a difference? It is not returning or printing anything when I run it. – jcalton88 Nov 16 '16 at 05:23
  • Modified the answer to include a non ecma 6 version. See how that goes. Would help if you would describe what's happening instead of just "doesn't appear to be working" – Tibrogargan Nov 16 '16 at 05:28
  • Nothing happened as far as I could tell. I ran the code and nothing was logged to the console and stepping through I couldn't see where anything was being returned. I'll input this and do more testing. I have been temporarily pulled off this to work on another project so I'm trying to get this done in spare time. Thanks for the update. – jcalton88 Nov 16 '16 at 05:41
  • All it's doing is logging to the console, nothing is being returned. Your question never really specifies what's supposed to happen to this data. – Tibrogargan Nov 16 '16 at 05:45
  • Currently I am building an array and iterating through it getting the textContent from each item. When I ran your suggestion I wasn't getting anything printed to the console. It was poor choice of words on my part to say nothing was returned. Once I can get it to successfully print to the console I can just modify that code to push to my array. – jcalton88 Nov 16 '16 at 05:53
  • I will test further and update, I am not very good with jquery so it was probably something I did. Thank you! – jcalton88 Nov 16 '16 at 05:54
  • Sounds like the function isn't being called at all. It should at least either log the index or die and produce a message. – Tibrogargan Nov 16 '16 at 05:54