2

I'm trying to parse a simple HTML page with pup. This is a command-line HTML parser and it accepts general HTML selectors.

I want to select:

'div.aclass text{}' #(would be SampleA)

and I also want to select:

'div.bclass text{}' #(would be SampleB)

and I want to concatenate them and insert some custom text to get:

SampleA;MYEXTRASTRING;SampleB

I want to avoid calling pup more than once as it is slow.

I can select multiple tags:

'div.aclass text{}, div.bclass text{}'

but this will result:

SampleA
SampleB

Is there any better choice than pup for this purpose?

(Note: Python is NOT an option as it's very slow for my needs.)

Daniel
  • 2,318
  • 2
  • 22
  • 53

1 Answers1

4

Multiple selectors with pup seem not work, there is an issue here: https://github.com/ericchiang/pup/issues/59

To achieve your purpose, I would suggest to use hxselect command, which can be found inside HTML-XML-utils: https://www.w3.org/Tools/HTML-XML-utils/README

Example:

curl -s http://example.com/ | hxselect -c 'body > div:nth-child(1) > h1:nth-child(1)', 'body > div:nth-child(1) > p:nth-child(3) > a:nth-child(1)' -s ';MYEXTRASTRING;' | sed 's/\(.*\);MYEXTRASTRING;/\1/'

curl part:

curl is used to download html content of http://exmaple.com

hxselect part:

hxselect supports multiple CSS selectors. Use , to separate these selectors.

-c: print content only, without html tag

-s: separator text after each match. In your case, it's ;MYEXTRASTRING;

sed part:

Because -s separator text will be added for each match, it means it will be added twice. sed is used to remove the last matched separator text.

Kevin Cui
  • 766
  • 1
  • 6
  • 8
  • Thank you, this would work, however, it keeps telling me 'Input is not well-formed. (Maybe try normalize?)'. Do you know any tips to normalize the html before feeding it to hxselect? – Daniel Jan 04 '19 at 22:34
  • 2
    Try `... | hxnormalize -x | hxselect ...`? hxnormalize is another command in HTML-XML-utils, which is used to normalize html – Kevin Cui Jan 04 '19 at 22:55
  • Awesome! Thank you! Works as I wanted originally. – Daniel Jan 04 '19 at 23:00
  • Sorry, I was faster and did not notice the error msg. hxselect prints: Syntax error at ",". I tried with hxselect -c 'selector1', 'selector2'; hxselect -c 'selector1, selector2', etc. – Daniel Jan 04 '19 at 23:13
  • https://bugs.launchpad.net/ubuntu/+source/html-xml-utils/+bug/1620555 comma seems not working recently – Daniel Jan 04 '19 at 23:19
  • @Daniel Using "," as separator works well for me. Could you try `curl -s http://example.com/ | hxselect -c 'title,body'`? If not, which hxselect version you have? hxselect -v. – Kevin Cui Jan 04 '19 at 23:39
  • Running it: `curl -s http://example.com/ | hxselect -c 'title,body' Syntax error at "," (23) Failed writing body` Version: `hxselect -v Version: html-xml-utils 7.1`. Which version do you have? I'll try compiling it manually. I just installed this via apt. – Daniel Jan 05 '19 at 09:12
  • 1
    I complied hxselect 7.1. It indeed shows syntax error. Try the latest version 7.7, it works without error. – Kevin Cui Jan 05 '19 at 22:07
  • 1
    It seems (as of version 7.7) that `hxselect` does not really support "multiple" comma-separated selectors, but only two. When given more than two selectors, it only consider the first and last provided selectors. Can anyone confirm ? – Skippy le Grand Gourou Dec 02 '19 at 12:01
  • 2
    @SkippyleGrandGourou seems so, you're right. It happens to me as well, the middle selectors are ignored... – Kevin Cui Dec 02 '19 at 14:28
  • 1
    @KevinCui [Also confirmed here](https://stackoverflow.com/questions/48493639/middle-selectors-ignored-in-hxselect). I couldn't find a W3C repository to report the bug (apparently [they closed their public Bugzilla last April](https://www.w3.org/2019/01/bugzilla-shutdown.html#migration)). – Skippy le Grand Gourou Dec 03 '19 at 20:27
  • The current package (v. 8.0) from https://www.w3.org/Tools/HTML-XML-utils no longer has this 2-selector limitation. – DanB Oct 24 '21 at 15:49