Is it possible to define a HTML selector that concatenates multiple selectors and separates them by semicolon?

Question

I'm trying to parse a simple HTML page with pup. This is a command-line HTML parser and it accepts general HTML selectors.

I want to select:

'div.aclass text{}' #(would be SampleA)

and I also want to select:

'div.bclass text{}' #(would be SampleB)

and I want to concatenate them and insert some custom text to get:

SampleA;MYEXTRASTRING;SampleB

I want to avoid calling pup more than once as it is slow.

I can select multiple tags:

'div.aclass text{}, div.bclass text{}'

but this will result:

SampleA
SampleB

Is there any better choice than pup for this purpose?

(Note: Python is NOT an option as it's very slow for my needs.)

score 4 · Answer 1 · answered Jan 04 '19 at 21:35

4

Multiple selectors with pup seem not work, there is an issue here: https://github.com/ericchiang/pup/issues/59

To achieve your purpose, I would suggest to use hxselect command, which can be found inside HTML-XML-utils: https://www.w3.org/Tools/HTML-XML-utils/README

Example:

curl -s http://example.com/ | hxselect -c 'body > div:nth-child(1) > h1:nth-child(1)', 'body > div:nth-child(1) > p:nth-child(3) > a:nth-child(1)' -s ';MYEXTRASTRING;' | sed 's/\(.*\);MYEXTRASTRING;/\1/'

curl part:

curl is used to download html content of http://exmaple.com

hxselect part:

hxselect supports multiple CSS selectors. Use , to separate these selectors.

-c: print content only, without html tag

-s: separator text after each match. In your case, it's ;MYEXTRASTRING;

sed part:

Because -s separator text will be added for each match, it means it will be added twice. sed is used to remove the last matched separator text.

answered Jan 04 '19 at 21:35

Kevin Cui

766
1
6
8

Thank you, this would work, however, it keeps telling me 'Input is not well-formed. (Maybe try normalize?)'. Do you know any tips to normalize the html before feeding it to hxselect? – Daniel Jan 04 '19 at 22:34
2

Try `... | hxnormalize -x | hxselect ...`? hxnormalize is another command in HTML-XML-utils, which is used to normalize html – Kevin Cui Jan 04 '19 at 22:55
Awesome! Thank you! Works as I wanted originally. – Daniel Jan 04 '19 at 23:00
Sorry, I was faster and did not notice the error msg. hxselect prints: Syntax error at ",". I tried with hxselect -c 'selector1', 'selector2'; hxselect -c 'selector1, selector2', etc. – Daniel Jan 04 '19 at 23:13
https://bugs.launchpad.net/ubuntu/+source/html-xml-utils/+bug/1620555 comma seems not working recently – Daniel Jan 04 '19 at 23:19
@Daniel Using "," as separator works well for me. Could you try `curl -s http://example.com/ | hxselect -c 'title,body'`? If not, which hxselect version you have? hxselect -v. – Kevin Cui Jan 04 '19 at 23:39
Running it: `curl -s http://example.com/ | hxselect -c 'title,body' Syntax error at "," (23) Failed writing body` Version: `hxselect -v Version: html-xml-utils 7.1`. Which version do you have? I'll try compiling it manually. I just installed this via apt. – Daniel Jan 05 '19 at 09:12
1

I complied hxselect 7.1. It indeed shows syntax error. Try the latest version 7.7, it works without error. – Kevin Cui Jan 05 '19 at 22:07
1

It seems (as of version 7.7) that `hxselect` does not really support "multiple" comma-separated selectors, but only two. When given more than two selectors, it only consider the first and last provided selectors. Can anyone confirm ? – Skippy le Grand Gourou Dec 02 '19 at 12:01
2

@SkippyleGrandGourou seems so, you're right. It happens to me as well, the middle selectors are ignored... – Kevin Cui Dec 02 '19 at 14:28
1

@KevinCui [Also confirmed here](https://stackoverflow.com/questions/48493639/middle-selectors-ignored-in-hxselect). I couldn't find a W3C repository to report the bug (apparently [they closed their public Bugzilla last April](https://www.w3.org/2019/01/bugzilla-shutdown.html#migration)). – Skippy le Grand Gourou Dec 03 '19 at 20:27
The current package (v. 8.0) from https://www.w3.org/Tools/HTML-XML-utils no longer has this 2-selector limitation. – DanB Oct 24 '21 at 15:49

Is it possible to define a HTML selector that concatenates multiple selectors and separates them by semicolon?

1 Answers1