9

I'm trying to grab data from HTML output that looks like this:

<strong>Target1NoSpaces</strong><span class="creator"> ....
<strong>Target2 With Spaces</strong><span class="creator"> ....

I'm using a pipe train to whittle down the data to the targets I'm trying to hit. Here's my approach so far:

grep "/strong" output.html | awk '{print $1}'

Grep on "/strong" to get the lines with the targets; that works fine.

Pipe to 'awk '{print $1}'. That works in case #1 when the target has no spaces, but fails in case #2 when the target has spaces..only the first word is preserved as below:

<strong>Target1NoSpaces</strong><span
<strong>Target2

Do you have any tips on hitting the target properly, either in my awk or in different command? Anything quick and dirty (grep, awk, sed, perl) would be appreciated.

Michael J
  • 1,383
  • 2
  • 14
  • 25
  • 2
    What you're trying to do is "screen scraping". I assume that sooner or later you'll need something more general than "the text between 'strong'". Since you put "Perl" in your tags, I would encourage you to look at [WWW::Mechanize](http://search.cpan.org/~petdance/WWW-Mechanize-1.54/lib/WWW/Mechanize.pm). Otherwise, look here: [Linux - grep regex to pull out a string between two known strings](http://stackoverflow.com/questions/11850031/grep-regex-to-pull-out-a-string-between-two-known-strings). – paulsm4 Sep 11 '13 at 16:54

8 Answers8

14

Try pup, a command line tool for processing HTML. For example:

$ pup 'strong text{}' < file.html 
Target1NoSpaces
Target2 With Spaces

To search via XPath, try xpup.

Alternatively, for a well-formed HTML/XML document, try html-xml-utils.

kenorb
  • 155,785
  • 88
  • 678
  • 743
  • Thanks for this tip! I still follow this thread to stay up to date on simple CLI parsing. Thanks to your answer, TIL about `pup`! :D – Michael J Apr 22 '18 at 20:40
7

One way using mojolicious and its DOM parser:

perl -Mojo -E '
    g("http://your.web")
    ->dom
    ->find("strong")
    ->each( sub { if ( $t = shift->text ) { say $t } } )'
Birei
  • 35,723
  • 2
  • 77
  • 82
6

Using Perl regex's look-behind and look-ahead feature in grep. It should be simpler than using awk.

grep -oP "(?<=<strong>).*?(?=</strong>)" file

Output:

Target1NoSpaces
Target2 With Spaces

Add:

This implementation of Perl's regex's multi-matching in Ruby could match values in multiple lines:

ruby -e 'File.read(ARGV.shift).scan(/(?<=<strong>).*?(?=<\/strong>)/m).each{|e| puts "----------"; puts e;}' file

Input:

<strong>Target
A
B
C
</strong><strong>Target D</strong><strong>Target E</strong>

Output:

----------
Target
A
B
C
----------
Target D
----------
Target E
konsolebox
  • 72,135
  • 12
  • 99
  • 105
  • 1
    +1 good approach if the tag doesn't split over multiple lines. – Chris Seymour Sep 11 '13 at 17:04
  • 1
    This works perfectly. The tags are on a single line and are words or a collection of two/three words. The output is regular and will always be the same so an approach that does anything more is overkill. I like the awk and perl implementations and will tuck those away for future use. Thanks, all for elevating my knowledge! – Michael J Sep 11 '13 at 17:31
  • @sudo_O Perl regex's multi-line feature could actually be used for that. I added a concept implementation of it in Ruby. Michael J: Welcome :) – konsolebox Sep 11 '13 at 17:40
4

Here's a solution using xmlstarlet

xml sel -t -v //strong input.html
Slaven Rezic
  • 4,571
  • 14
  • 12
3

Trying to parse HTML without a real HTML parser is a bad idea. Having said that, here is a very quick and dirty solution to the specific example you provided. It will not work when there is more than one <strong> tag on a line, when the tag runs over more than one line, etc.

awk -F '<strong>|</strong>' '/<strong>/ {print $2}' filename
Community
  • 1
  • 1
ThisSuitIsBlackNot
  • 23,492
  • 9
  • 63
  • 110
3

You never need grep with awk and the field separator doesn't have to be whitespace:

$ awk -F'<|>'  '/strong/{print $3}' file
Target1NoSpaces
Target2 With Spaces

You should really use a proper parser for this however.

Chris Seymour
  • 83,387
  • 30
  • 160
  • 202
  • 1
    That could match any text with `strong` in it. Or just a newline if no field was found. – konsolebox Sep 11 '13 at 17:00
  • 2
    @konsolebox if we get into the pit falls of parsing html with `awk` we could be here a while. I simply wanted to demonstrate the use in my point, the OP can filter/make more robust as appropriate or take my suggestion and use a proper html parser. – Chris Seymour Sep 11 '13 at 17:02
1

Since you tagged perl

perl -ne 'if(/(?:<strong>)(.*)(?:<\/strong>)/){print $1."\n";}' input.html
Jean
  • 21,665
  • 24
  • 69
  • 119
0

I am surprised no one mensions W3C HTML-XML-utils

curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
  hxnormalize -x |
  hxselect -s '\n' strong

output:

<strong class="fc-black-750 mb6">Stack Overflow
                    for Teams</strong>
<strong>Teams</strong>

To capture only content:

curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
  hxnormalize -x |
  hxselect -s '\n' -c strong
Stack Overflow
                    for Teams
Teams

Weihang Jian
  • 7,826
  • 4
  • 44
  • 55