Parsing HTML on the command line; How to capture text in ?

Question

I'm trying to grab data from HTML output that looks like this:

<strong>Target1NoSpaces</strong><span class="creator"> ....
<strong>Target2 With Spaces</strong><span class="creator"> ....

I'm using a pipe train to whittle down the data to the targets I'm trying to hit. Here's my approach so far:

grep "/strong" output.html | awk '{print $1}'

Grep on "/strong" to get the lines with the targets; that works fine.

Pipe to 'awk '{print $1}'. That works in case #1 when the target has no spaces, but fails in case #2 when the target has spaces..only the first word is preserved as below:

<strong>Target1NoSpaces</strong><span
<strong>Target2

Do you have any tips on hitting the target properly, either in my awk or in different command? Anything quick and dirty (grep, awk, sed, perl) would be appreciated.

What you're trying to do is "screen scraping". I assume that sooner or later you'll need something more general than "the text between 'strong'". Since you put "Perl" in your tags, I would encourage you to look at [WWW::Mechanize](http://search.cpan.org/~petdance/WWW-Mechanize-1.54/lib/WWW/Mechanize.pm). Otherwise, look here: [Linux - grep regex to pull out a string between two known strings](http://stackoverflow.com/questions/11850031/grep-regex-to-pull-out-a-string-between-two-known-strings). — paulsm4, Sep 11 '13 at 16:54

kenorb · Answer 1 · 2018-04-12T14:56:08.640

14

Try pup, a command line tool for processing HTML. For example:

$ pup 'strong text{}' < file.html 
Target1NoSpaces
Target2 With Spaces

To search via XPath, try xpup.

Alternatively, for a well-formed HTML/XML document, try html-xml-utils.

edited Apr 12 '18 at 14:56

answered Apr 10 '18 at 23:10

kenorb

155,785
88
678
743

Thanks for this tip! I still follow this thread to stay up to date on simple CLI parsing. Thanks to your answer, TIL about `pup`! :D – Michael J Apr 22 '18 at 20:40

score 7 · Answer 2 · answered Sep 11 '13 at 17:02

7

One way using mojolicious and its DOM parser:

perl -Mojo -E '
    g("http://your.web")
    ->dom
    ->find("strong")
    ->each( sub { if ( $t = shift->text ) { say $t } } )'

answered Sep 11 '13 at 17:02

Birei

35,723
2
77
82

konsolebox · Accepted Answer · 2013-09-11T17:39:28.403

6

Using Perl regex's look-behind and look-ahead feature in grep. It should be simpler than using awk.

grep -oP "(?<=<strong>).*?(?=</strong>)" file

Output:

Target1NoSpaces
Target2 With Spaces

Add:

This implementation of Perl's regex's multi-matching in Ruby could match values in multiple lines:

ruby -e 'File.read(ARGV.shift).scan(/(?<=<strong>).*?(?=<\/strong>)/m).each{|e| puts "----------"; puts e;}' file

Input:

<strong>Target
A
B
C
</strong><strong>Target D</strong><strong>Target E</strong>

Output:

----------
Target
A
B
C
----------
Target D
----------
Target E

edited Sep 11 '13 at 17:39

answered Sep 11 '13 at 16:54

konsolebox

72,135
12
99
105

1

+1 good approach if the tag doesn't split over multiple lines. – Chris Seymour Sep 11 '13 at 17:04
1

This works perfectly. The tags are on a single line and are words or a collection of two/three words. The output is regular and will always be the same so an approach that does anything more is overkill. I like the awk and perl implementations and will tuck those away for future use. Thanks, all for elevating my knowledge! – Michael J Sep 11 '13 at 17:31
@sudo_O Perl regex's multi-line feature could actually be used for that. I added a concept implementation of it in Ruby. Michael J: Welcome :) – konsolebox Sep 11 '13 at 17:40

score 4 · Answer 4 · answered Sep 11 '13 at 19:50

4

Here's a solution using xmlstarlet

xml sel -t -v //strong input.html

answered Sep 11 '13 at 19:50

Slaven Rezic

4,571
14
12

score 3 · Answer 5 · edited May 23 '17 at 12:08

3

Trying to parse HTML without a real HTML parser is a bad idea. Having said that, here is a very quick and dirty solution to the specific example you provided. It will not work when there is more than one <strong> tag on a line, when the tag runs over more than one line, etc.

awk -F '<strong>|</strong>' '/<strong>/ {print $2}' filename

edited May 23 '17 at 12:08

Community

1
1

answered Sep 11 '13 at 16:57

ThisSuitIsBlackNot

23,492
9
63
110

score 3 · Answer 6 · answered Sep 11 '13 at 16:58

3

You never need grep with awk and the field separator doesn't have to be whitespace:

$ awk -F'<|>'  '/strong/{print $3}' file
Target1NoSpaces
Target2 With Spaces

You should really use a proper parser for this however.

answered Sep 11 '13 at 16:58

Chris Seymour

83,387
30
160
202

1

That could match any text with `strong` in it. Or just a newline if no field was found. – konsolebox Sep 11 '13 at 17:00
2

@konsolebox if we get into the pit falls of parsing html with `awk` we could be here a while. I simply wanted to demonstrate the use in my point, the OP can filter/make more robust as appropriate or take my suggestion and use a proper html parser. – Chris Seymour Sep 11 '13 at 17:02

score 1 · Answer 7 · answered Sep 11 '13 at 17:04

1

Since you tagged perl

perl -ne 'if(/(?:<strong>)(.*)(?:<\/strong>)/){print $1."\n";}' input.html

answered Sep 11 '13 at 17:04

Jean

21,665
24
69
119

score 0 · Answer 8 · answered Apr 20 '22 at 12:36

I am surprised no one mensions W3C HTML-XML-utils

curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
  hxnormalize -x |
  hxselect -s '\n' strong

output:

<strong class="fc-black-750 mb6">Stack Overflow
                    for Teams</strong>
<strong>Teams</strong>

To capture only content:

curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
  hxnormalize -x |
  hxselect -s '\n' -c strong

Stack Overflow
                    for Teams
Teams

Parsing HTML on the command line; How to capture text in ?

8 Answers8

Linked