0

I have a Chrome extension browser action that extracts the metadata from a current tab. It works as expected, except on certain webpages (e.g., YouTube) where elements (e.g., description) are not from the current page, but are from the homepage (or where the last manual page refresh occurred within the same domain). The title and URL are always being retrieved correctly, but using different code (chrome.tabs.query).

An example of the script code for meta description (using chrome.tabs.executeScript):

var descriptionElement = document.querySelector("meta[name=\'description\']");
if (descriptionElement != null) 
  descriptionElement = descriptionElement.getAttribute("content");

This code will return, "Share your videos with friends, family, and the world" (which is the homepage meta description) on any YouTube.com page, unless I reload the page or arrive there from a direct link.

I've noticed that the original HTML ("View page source") is always correct, but Chrome Developer Tools Elements is displaying the incorrect head data.

How can I extract the current tab without requiring a full page reload? Is there a way other than querySelector that can bypass the browser's rendering and access the HTML directly?

  • Check out this article on How to push new HTML data from the server to the browser without refreshing the page. http://stackoverflow.com/questions/16028921/how-to-push-new-html-data-from-the-server-to-the-browser-without-refreshing-the it should hopefully solve your problem. – Sean Carroll Nov 26 '14 at 21:11
  • @Sean Carroll Unless I'm mistaken, my issues aren't a result of websockets. It's not as though the webpages are sending new data. For example: Simply go to YouTube.com. Click on any video. Check the data in the Chrome Developer Tools. It's wrong. Refresh the page. Now it's right. – AvidSnacker Nov 26 '14 at 21:23
  • "View page source" re-retrieves the page, so it's the same as if you refreshed it. Seems like YouTube navigation (that does not actually navigate but manipulates history state) does not refresh that part of the DOM, and therefore it's not the extension's fault. Developer Tools represent the DOM as it actually is. – Xan Nov 26 '14 at 21:36
  • @Xan I think this is definitely the issue here. Any ideas how I might circumvent this? Is there a way to access the offending DOM parts (e.g., ) without a page refresh? Also, why would a website not want some of the metadata to be relevant in the DOM? – AvidSnacker Nov 26 '14 at 21:53
  • Why? I assume it's because `` fields are more for robots and such which will not use JS-based navigation. You could either scrape the page with an XHR (which will return the same as page source does) or construct another query that looks for the description on the page. – Xan Nov 26 '14 at 21:56
  • @Xan Thanks! Queries on the page will be site-specific obviously (e.g., searching by id), so I'm curious how to scrape with an XHR. Read this [link](http://stackoverflow.com/questions/13765031/scrape-eavesdrop-ajax-data-using-javascript) but isn't exactly clear to me. – AvidSnacker Nov 26 '14 at 22:21

2 Answers2

1

YouTube supports Microdata, and stores the latest description of the page in a <meta itemprop="description" content="..."> tag within the body. To make use of this feature, replace your meta[name="description"] selector with meta[itemprop="description"].

YouTube uses pushState-based navigation to update the document's view without really "reloading" the page. That's why there is a discrepancy between the live DOM and the view-source DOM. If you want to detect when such a navigation occurs, read this answer to "Chrome extension is not loading on browser navigation at YouTube".


On pages that do not support microdata, but behave as described in your question, you could fetch the page and query the DOM using the following snippet (assuming that you have host permissions for the page):

var url = 'https://www.youtube.com/'; // e.g. from tab.url
var x = new XMLHttpRequest();
x.open('GET', url);
x.responseType = 'document';
x.onload = function() {
    var doc = x.response;
    var desc = doc.querySelector('meta[name="description"]');
    if (desc) {
         desc = desc.getAttribute('content');
         // Do something, e.g. show in dialog:
         alert(desc);
    }
};
x.send();
Rob W
  • 341,306
  • 83
  • 791
  • 678
-1

OK, I get what you mean, I had a similar issue a while ago when I was trying to automatically goto a dynamically created #anchor on another page. to get it to work I used.

<body onload="goToAnchor();">
<script type="text/javascript">
   function goToAnchor() 
   {
     location.href = "#openModal";
   }
</script>

maybe you could use the onload=yourFuntion(); to achieve what you require?

hope this helps.

Sean Carroll
  • 107
  • 1
  • 14
  • This makes zero sense as an answer to this question. – Xan Nov 26 '14 at 21:45
  • First off, can't put inline script (onload="") into a chrome extension. _Can_ put it into the js file itself, however. In any case, I'm already using '$(document).ready'. Doesn't seem to be an issue with timing of load. – AvidSnacker Nov 26 '14 at 21:50