How to scrape bad formatting HTML code with x-ray

Question

I am using the x-ray module for the first time.

I have no problem for using it but I have some issue when I try to scrape data in a bad formatting HTML code.

For example, if I try to scrape this HTML code from a website:

<div class="item">
<dl class="list">
    <dd id="1"> Data1
    <dd id="2"> Data2
    <dd id="3"> Data3
</dl>

using this code:

x(html, '.item', [{
    tags: x('.item', 'dd:nth-child(1)')
}])
(function(err, obj) {
    var jsonCleaned = JSON.parse(JSON.stringify(obj).replace(/"\s+|\s+"/g,'"').replace(/\\n/g, ''))
    res.json(jsonCleaned);
})

I get the following result:

[
      {
                "tags": "Data1 Data2 Data3"
      }

]

My scraping code works if the DD tags are closed.

[
      {
                "tags": "Data1"
      }
]

Any solution on how to resolve this problem ?

I think that if the X-Ray library fails to process necessary HTML, the only solutions would be either to try another lib **or** tidy up the HTML before giving it to X-Ray by some other lib ("tidy" etc). — Andrew Dunai, Apr 19 '16 at 09:11
BTW, do you use PhantomJS transport for it? PhantomJS should be able to process such HTML without problems. — Andrew Dunai, Apr 19 '16 at 09:14
Hi @AndrewDunai, thanks for your help. For now, I am not using PhantomJS. But I will try this module https://github.com/lapwinglabs/x-ray-phantom, perhaps it will be helpful :) Thanks again — Cyril Vandenberghe, Apr 19 '16 at 09:47

score 0 · Answer 1 · answered Apr 19 '16 at 16:43

Here is my own solution if someone encounters the same problem in the future.

I just use the htmltidy module.

tidy(html, function (err, html) {
    x(html, '.item', [{
        tags: x('.item', 'dd:nth-child(1)')
    }])
    (function(err, obj) {
        var jsonCleaned = JSON.parse(JSON.stringify(obj).replace(/"\s+|\s+"/g,'"').replace(/\\n/g, ''));
        res.json(jsonCleaned);
    })
});

After that, bad formatting HTML code is no longer a problem.

How to scrape bad formatting HTML code with x-ray

1 Answers1