4

I just noticed that browser ignores HTML formatting (such as aligning the two attributes in this snippet):

 <div id="container">
      <div id="contained"
           other-prop="some value">
      </div>
 </div>

If you run

var container = document.getElementById('container');
console.log(container.innerHTML);

You get the output

<div id="contained" other-prop="some value">
</div>

It doesn't matter how it's written in the source or even if you set .innerHTML in JavaScript directly.

Is it possible to query the page source corresponding to an element as the user wrote it in JS? With white-spaces and everything. I can see there being a problem when the user modifies the element using DOM operations, in which case I'll still be happy if

  1. Original user formatting is kept for everything untouched by modifications, or
  2. It's possible to get the original source as the user wrote it on page load, without the DOM modifications

A snippet so you can see it in action:

var container = document.getElementById('container');
console.log(container.innerHTML);

container.innerHTML = `  
  <div id="contained"
       other-prop="some value">
  </div>
`;

console.log(container.innerHTML);

container.children[0].setAttribute('modification', '');

console.log(container.innerHTML);
<div id="container">
  <div id="contained"
       other-prop="some value">
  </div>
</div>
Peeyush Kushwaha
  • 3,453
  • 8
  • 35
  • 69
  • Not possible to read the stream that the browser uses to compose the page. – Travis J Aug 23 '19 at 20:59
  • @TravisJ any references for that? (mentioned somewhere in the documentation / another SO thread...?) – Peeyush Kushwaha Aug 24 '19 at 05:18
  • The stream is read at the application level. There is no "documentation" for this, because it is essentially common knowledge, just as there would be no documentation stating that the registry cannot be accessed by JavaScript from a webpage. Accessing the stream would be exiting the sandbox, and would essentially mean you had gained access to the operating system level of instruction execution since that is where the application executes. – Travis J Aug 24 '19 at 20:20
  • @TravisJ got you. Is it also the case that the browser does not expose the contents of the stream through some API? – Peeyush Kushwaha Aug 25 '19 at 17:11
  • @PeeyushKushwaha The browser does not expose the raw data of the page in any API. The only way to accomplish this that I know of would be to query the text content of the page using a `fetch` request and then find the element in that text using RegEx or `indexOf()`, as I summarized in my answer. – IronFlare Aug 25 '19 at 17:27

2 Answers2

0

It is not possible at all. Think code as a message between computers to express visual representation, where whitespace is not important.

hmassad
  • 19
  • 2
  • 6
0

Using conventional methods, no, this is not possible. HTML always discards excess whitespace when it renders the page, and unfortunately, there's no way to disable this functionality.

In short, when you write HTML code, you give the browser instructions for what to render, but not how to render it. When you load a page, the browser interprets those instructions and outputs a rendering of what it thinks you wanted.

When you use innerHTML, you're requesting that the browser convert that rendered content back into HTML instructions. It does this almost perfectly, but it can't and won't put back the whitespace it removed; since that information doesn't affect the look of the page, the browser ignores it when rendering.

If you're comfortable with throwing all best practice out the window, you could theoretically use a Fetch request to query the server for the HTML content of the page you're on, then parse the response as plaintext.

This is problematic for your specific use case, however, since if you want to retrieve a specific element from this text, you don't have any DOM methods or utilities at your disposal. If you try to parse the plaintext using DOMParser or something similar, the text will start acting like HTML again and discard the excess whitespace.

Your best bet, if you still really want to do this, would be to use a RegEx or .indexOf() to find the element you're looking for in the middle of the plaintext response. I really do want to emphasize, though, that this is extremely bad practice and shouldn't be used for anything outside of research.

It's also important to note that if the page you're attempting to process is a client-generated SPA (single-page app), then this solution will not work. If you are dealing with a client-side SPA (e.g., React, Angular, Vue, etc.), it's possible you could reverse-engineer the rendering scripts to find the definition for the element that contains the whitespace. Beyond that, however, you're likely out of luck.

IronFlare
  • 2,287
  • 2
  • 17
  • 27
  • I agree that whitespace information is unnecessary to the parser. The question remains how to get that information in spite of that. I [disagree that the best way to parse it myself would be using RegEx](https://stackoverflow.com/a/1732454/1412255) – Peeyush Kushwaha Aug 24 '19 at 18:04
  • @PeeyushKushwaha I love that post, and I agree on principle that if you're trying to *parse* an entire page, there are much better ways to do it. *However*, for your use case, which requires you to find an arbitrary element with variable amounts of whitespace, it's either that, `indexOf()`, or nothing, essentially. There simply is no other way than to retrieve the full text content of the HTML file and *somehow* find the element you're looking for in the text. Converting it into a DOM representation using a DOMParser will remove whitespace, so your only option is to process the string content. – IronFlare Aug 25 '19 at 17:31