-1

I'm scraping the source code of a website.

My first print prints out the complete source code.

Then the second print prints an actual DOM to the console, but for some reason the contents of the document change just slightly.

A thing that bugs me is that the <body> tag goes missing and I have no idea why.

I just realized the <head> tag goes missing as well. So there might be a good reason for it.

TO CLARIFY: The content of both the <head> and <body> tags remain together in the container. Just the tags themselves disappear, not their content.

I want the whole source code to be parsed into an accessible DOM.

This is the code:

$.ajax({url: url, dataType: "text", success: function(data) { 

    console.log("data:", data);

    var htmlDocument = $("<html>").html(data)[0];

    console.log("htmlDocument:", htmlDocument);

}});

I am new to JavaScript, thank you for any help. I am eager to understand the issue but for now I really just want it to work.

Alohci
  • 78,296
  • 16
  • 112
  • 156
felixmp
  • 307
  • 3
  • 16
  • 1
    What exactly do you want to do with it? Please elaborate on use case a bit more. Note that jQuery html() removes `` and `` – charlietfl Aug 05 '18 at 16:54
  • i want to access the body tag. search its content. i just realized the tag is missing as well. why is that? – felixmp Aug 05 '18 at 16:55
  • If all you want is to search through it do `var $content = $('
    ').html(data);` then can use `find()` on `$content` ... `console.log($content.find('div').length)` for example will find count of div from other page
    – charlietfl Aug 05 '18 at 16:57
  • whats the difference betweens `$('
    ').html(data);`and `$('').html(data);`? i would prefer to have an exact copy of the source code accessible as a DOM.
    – felixmp Aug 05 '18 at 17:01
  • No real need to use `` since all you said you want to do is look in content – charlietfl Aug 05 '18 at 17:02
  • well why not `$().html(data);` than? i havent completely understood the .html() function yet. thank you for your help – felixmp Aug 05 '18 at 17:04
  • Because you want an outer container to use `find()` from – charlietfl Aug 05 '18 at 17:05
  • i understand! Do you have any idea though why the and tags go missing? both it's contents get just added together. i now have the container (be it
    or ) containing the 's and 's content
    – felixmp Aug 05 '18 at 17:08
  • Because they get stripped out by jquery because a page can only have one head and one body ...any more is invalid. html() is dominantly used right in the dom itself although it also works outside the dom – charlietfl Aug 05 '18 at 17:09
  • oh okay, so because the variable i assign the `$('').html(data);` to is part of my whole document (is it?), my head and body tags get stripped? cant i have the scraped DOM in a variable and just access it like a different page? – felixmp Aug 05 '18 at 17:12
  • No...that is completely outside the current document. It is only an object in memory until you insert it into the document – charlietfl Aug 05 '18 at 17:14
  • the answer what clearly in the jQuery docs, as @codemiror wrote in the answer. ref: http://api.jquery.com/html/ – Itamar Aug 05 '18 at 17:43
  • @charlietfl okay, that makes sense. since i dont want to add it to my document but keep it in memory, why is body and head stripped allready? why is is not stripped only when i add it to the codument? thank you – felixmp Aug 06 '18 at 02:48

1 Answers1

1

As Charlietfl said

Note that jQuery .html() removes body and head

Try

 $('html')[0].outerHTML

or

document.documentElement.outerHTML

See more here: How do I get the entire page's HTML with jQuery?

codemirror
  • 3,164
  • 29
  • 42