2

I need to extract information from HTML files. For most of them, I just need to match a particular DOM element's content or attribute, so I use XPATH expressions like //a[@class="targeturl"]/@href and the command line tool xidel.

In a different batch of files the information I want is in a script, not so readily available:

<html>
<head><!-- ... --></head>
<body>
    ...
    <script>
        ...
        var o = {
            "numeric": 1234,
            "target": "TARGET",
            "urls": "http://example.com",
            // Commented pair "strings": "...",
            "arrays": [
               {
                  "more": true
               }
               ,
               { 
                  "itgoeson": true
               }
            ]
        };
    </script>
    ...
</body>
</html>

Note that the object containing the value I want to get is not valid JSON. However, it seems to respect one key-value pair per line.

What can I pass to xidel --xpath "???" to get this TARGET?

I've tried different thing with XPATH functions but I can't get to a solution without piping to other commands (match tells me yes/no, replace works line by line..., etc).

dmcontador
  • 660
  • 1
  • 8
  • 18

2 Answers2

1

Try to implement below XPath:

substring-before(substring-after(//script, '"target": '), ",")
Andersson
  • 51,635
  • 17
  • 77
  • 129
1

What can I pass to xidel --xpath "???" to get this TARGET?

Since var o is actually JSON, I suggest you treat it as such:

-e "json(
      //script/extract(
        .,
        'var o = (.+);',
        1,'s'
      )[.]
    )/target"
  • Extract {"field1": 1234, "target": "TARGET", "morefields": "..."} from the <script> element node (the json covers several lines, so don't forget the 's' regex-flag).
  • Interpret the output as json by wrapping json( ) around it (or //script/...[.] ! json(.)) and select the target attribute.

[edit]
To remove the comments (beginning with //):

-e "json(
      //script/replace(
        extract(
          .,
          'var o = (.+);',
          1,'s'
        )[.],
        '\s+//.+',
        ''
      )
    )/target"

Not the most prettiest query, but it works.
[/edit]

Reino
  • 3,203
  • 1
  • 13
  • 21
  • > *Since var o is actually JSON*, I suggest you treat it as such Sadly the real file doesn't really have _valid_ JSON and `xidel` fails to parse it. Nice approach, complete, detailed answer, though. – dmcontador May 15 '18 at 05:48
  • Then what does this real file of yours actually look like? I've dealt with invalid JSON before. – Reino May 15 '18 at 20:46
  • This one has commented lines. It is JavaScript after all, and I can't be sure other batches will have unquoted keys or other offending syntax. I'm updating the question with a more representative sample. – dmcontador May 17 '18 at 09:15