0

I have been trying this simple task for hours. No available libraries seem to help and no questions here seem to tackle this scenario.

It's fairly simple:

  • I have an entire page's markup as a string.
  • I need to use CSS selectors to point to the elements I need to scrape the data from.
  • I DO NOT want to create actual HTML DOM elements. Only scrape the data from them. The page might contain image, audio, video and other elements that I don't want to create.
  • It needs to be able to deal with markup errors and HTML5-style tagging. Currently, trying to parse it as XML throws an "Invalid XML" exception.
  • It needs to happen in the browser. So, no NodeJS modules.

In JAVA I've been able to do exactly this using JSoup. But there doesn't seem to be an equivalent library for JS running on a browser.

Thanks for your time.

cesarbrie
  • 43
  • 1
  • 6
  • Possible duplicate of [Parse a HTML String with JS](http://stackoverflow.com/questions/10585029/parse-a-html-string-with-js) – JRodDynamite Jul 29 '16 at 08:35
  • @JRodDynamite Not really. That post doesn't say anything about avoiding creating the HTML Elements or using CSS to target the elements containing the data. – cesarbrie Jul 29 '16 at 08:40
  • The second answer in the duplicate link is the better option, by the way - the first answer would cause any images etc to be downloaded, whereas with DOMParser this does not seem to be the case – Jaromanda X Jul 29 '16 at 08:41
  • @cesarbrie - using DOMParser, you can then use `.querySelector` and `.querySelectorAll` methods just like any page ... oh, and HTML Elements ARE created, but as I said above, no external resources are actually loaded (images, javascript, video etc) – Jaromanda X Jul 29 '16 at 08:42
  • @JaromandaX That sounds interesting. Do you happen to know whether inline script elements will be executed? – cesarbrie Jul 29 '16 at 08:45
  • no they are not in my testing – Jaromanda X Jul 29 '16 at 08:47
  • Sounds wonderful. I'm going to try it now. Thanks! – cesarbrie Jul 29 '16 at 08:48
  • @JaromandaX You were right. It does work. I'm not a frequent poster on StackExchange. Should I answer my own question. Do you want to answer it for the reputation? Should I do something else? – cesarbrie Jul 29 '16 at 08:59
  • Is it important? The answer was given in another question linked to by someone else, so I wouldn't want rep for it, all I did was point out that in my opinion the second answer, not the accepted, was the correct way for you to do it so that external resources (images etc) are NOT loaded – Jaromanda X Jul 29 '16 at 09:06

2 Answers2

0

@JaromandaX's suggestion was correct. A way to do this is to use a DOMParser object. It allows you to create the elements and then use .querySelector or .querySelectorAll on them while also not loading any external resources or running any scripts.

This is what worked for me:

var parser = new DOMParser();
var doc = parser.parseFromString(markup, "text/html");
cesarbrie
  • 43
  • 1
  • 6
0

You can use PHP Goutte or Python's BeautifulSoup4 library where you can use CSS Selectors or XPaths as well, whatever you are comfortable with.

Here are some simple examples to get started.

PHP Goutte:

require_once 'vendor/autoload.php';
use Goutte\Client;

$client = new Client();
$resp = $client->request('GET', $url);
foreach ($resp->filter(' your css selector here') as $li) {
// your logic here
}

Python BeautifulSoup example:

import requests
from bs4 import BeautifulSoup
timeout_time = 30;

def tryAgain(passed_url):
    try:
        page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
        return page
    except Exception:
        while 1:

            print("Trying again the URL:")
            print(passed_url)
            try:
                page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
                print("-------------------------------------")
                print("---- URL was successfully scraped ---")
                print("-------------------------------------")
                return page
            except Exception:
                time.sleep(20)
                continue

header = [{"User-Agent": "Mozilla/5.0 (Windows NT 5.1; rv:14.0) Gecko/20100101 Firefox/14.0.1"},
{"User-Agent":"Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14"},
{"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201"},
{"User-Agent":"Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25"}]

main_url = " your URL here "

main_page_html  = tryAgain(main_url)
main_page_soup = BeautifulSoup(main_page_html, "html.parser")

for a in main_page_soup.select(' css selector here '):
        print a.select(' your css selector here ')[0].text
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146