21

I have a HTML with lots of data and part I am interested in:

<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>

I try to use awk which now is:

awk -F "</*b>|</td>" '/<[b]>.*[0-9]/ {print $1, $2, $3 }' "index.html"

but what I want is to have:

54
1
0
0

Right now I am getting:

'<td align=right> 54'
'<td align=right> 1'
'<td align=right> 0'

Any suggestions?

kenorb
  • 155,785
  • 88
  • 678
  • 743
Lenny
  • 887
  • 1
  • 11
  • 32
  • Is he 2nd-last zero output because there's no `` tag at all or because there's a `` value of `0 (0/0)`? – Ed Morton Aug 18 '14 at 15:06

11 Answers11

29

awk is not an HTML parser. Use xpath or even xslt for that. xmllint is a commandline tool which is able to execute XPath queries and xsltproc can be used to perform XSL transformations. Both tools belong to the package libxml2-utils.

Also you can use a programming language which is able to parse HTML

hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • 1
    No-one said it was. Definitely can (easily) parse single pieces of data from it with awk though. –  Aug 18 '14 at 13:26
  • @Jidder No, you can't. Regex is a regular expression language, but XML is context-free. It is therefore theoretically impossible to use regex for parsing XML. – dirkk Aug 18 '14 at 15:23
  • 3
    @dirkk it's really not, it might be incredibly difficult(not impossible) to parse entire segments effectively, but for retrieving small pieces of data as the question asks its actually extremely easy with regex. Everyone just jumps on the don't parse XML/XHTML/HTML bandwagon without even understanding the argument in the first place, as you can see by the all the upvotes on this "answer". Look at the accepted answer which clearly parses the data in the question. –  Aug 19 '14 at 07:42
  • @Jidder I didn't "jumped on a bandwagon". I just described my personal experience. Of course you are free to see an XML/HTML document just as a piece of text, and use regexes to extract information. Btw, this is what a DOM parser is actually doing. However, most real world use cases are either more complex, fragile against modifications or simply variations of the input format, and therefore hard to maintain - in a real world project. I'm sure the up-voters made similar experiences. – hek2mgl Aug 19 '14 at 07:59
  • 1
    @hek2mgl You made a valid point and there is no denying that it is better to parse with a XML/HTML parser in real world projects.It's just i think this isn't really an answer to the question that OP asked and would be better suited as a comment.This particular question could be solved using regex and there is no need for OP to download a HTML parser. As for the Bandwagon comment i was referring to the upvotes rather than your actual answer. –  Aug 19 '14 at 08:26
  • 10
    @Jidder It is **impossible** to correctly parse XML using regex, not just difficult. The comment section is too short for a proof, but the chomsky hierarchie is good keyword for further research. This is scientifically proven. Just because it works in this case does not mean it is correct. The problem is that it _looks_ correct and this is why so many people try to use regex for XML parsing - And because that is incorrect and opens you up to a world of pain so many people advice against it. Rightfully so. – dirkk Aug 19 '14 at 08:43
  • @dirkk Provide me a link that proves it is impossible to parse XML using awk ? Maybe it is impossible using stripped down basic regex, but it is definitely not impossible using awk where you can use counters and functions to effectively keep track of how many braces you are inside. –  Aug 19 '14 at 10:09
  • 1
    @Jidder https://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – dirkk Aug 19 '14 at 10:14
  • 3
    @dirkk There will be no prove of that. Of course you can write an HTML parser in `awk` since it is Turing complete. Also you need to understand that extracting information from a text file, and fully understanding and representing a *document* are two different things. But hey I would still use a ready-to-use parser instead of writing a custom one again and again with `awk`.. – hek2mgl Aug 19 '14 at 10:15
  • @hek2mgl me too. As i said before it is definitely the better option, just trying to make people aware that it is possible and not just to blindly follow something that they have seen –  Aug 19 '14 at 10:25
  • 2
    @hek2mgl ah, I see. If awk is turing complete you are in fact correct (I don't now much about awk, I thought it is restricted to regular languages). So, to sum up: Don't use regular expressions to parse XML. You can use awk to parse XML, but you shouldn't (for the reason mentioned in the answer and here in the comments). – dirkk Aug 19 '14 at 10:28
  • 1
    Answer is correct, you should be using the right tools for the job. Otherwise you're less an artisan and more an amateur. – paxdiablo Apr 27 '15 at 15:47
  • Answer is unhelpful. Most XML tools are too strict and won't parse messy arbitrary HTML without a lot of help. – greyfade Nov 01 '19 at 19:21
  • @greyfade The html in the question is valid. If the question would be "how to extract information from a specific, helplessly invalid html document", then awk would probably a good choice. For writing a generic quirks mode html parser probably not. ps: iirc `xmllint --html` accepts invalid html up to some degree. – hek2mgl Nov 04 '19 at 18:06
  • 1
    @hek2mgl I've recently been informed of one that's much more forgiving. I posted an answer below. – greyfade Nov 06 '19 at 22:45
14
awk  -F '[<>]' '/<td / { gsub(/<b>/, ""); sub(/ .*/, "", $3); print $3 } ' file

Output:

54
1
0
0

Another:

awk  -F '[<>]' '
/<td><b>Total<\/b><\/td>/ {
    while (getline > 0 && /<td /) {
        gsub(/<b>/, ""); sub(/ .*/, "", $3)
        print $3
    }
    exit
}' file
konsolebox
  • 72,135
  • 12
  • 99
  • 105
  • @Lenny make sure you read and fully understand all the caveats discussed at http://awk.info/?tip/getline before using `getline`. In this case there's just no need for the `getline` loop at all, a simple flag would do `f{ subs(..); print; if (!/ – Ed Morton Aug 19 '14 at 12:51
  • @EdMorton You have to move `if (!/ 0` is completely safe, and safe enough if you read the manual properly. It's pretty clear how different syntaxes differ in function. The only thing to really take note of is *The getline command returns 1 on success, 0 on end of file, and -1 on an error.* – konsolebox Aug 19 '14 at 13:06
  • Yes, the test on `!/ "foo"` in 2 places whereas with the normal approach of just letting the awk loop do what it does you only need to put the `print > "foo"` in one place. Avoiding getline when it's not needed isn't only about writing safe code, its also about writing code that can be maintained and extended easily. – Ed Morton Aug 19 '14 at 13:22
  • @EdMorton I don't agree about it being extended easily. See this code I had written a long time ago where flags (over getline) can barely apply: http://sourceforge.net/p/playshell/code/ci/master/tree/loader/compiler.gawk. The last update I made was just to make sure `getline` returns 1 and not just nonzero.×Comments may only be edited for 5 minutes×Comments may only be edited for 5 minutes×Comments may only be edited for 5 minutes – konsolebox Aug 19 '14 at 13:36
  • @konsolebox I just gave a simple, common example of non-getline code being easier to extend. In any case, my comment was directed to the OP, now he is aware of the pros/cons and different opinions on the appropriate getline usage. I looked at your compiler code and it could have been written more robustly and concisely without getline. Just kidding - of course I'm not going to read hundreds of lines of awk code and try to figure out what it does and what it'd look like without getline or do any other kind of analysis on it. – Ed Morton Aug 19 '14 at 13:41
  • To be clear - I'm not saying never use getline. I use it myself when I want to, say, read in a mapping file in the BEGIN section before parsing multiple files in a different format or doing recursive-descent inlining of files that use `include` directives. I'm just saying make sure you're aware of all the caveats when making a decision to use it or not, and don't use it when it's just as easy or easier to use awks normal implicit file-reading loop because in addition to all of it's caveats it usually makes the next thing you want to do more difficult. – Ed Morton Aug 19 '14 at 13:49
5

HTML-XML-utils

You may use htmlutils for parsing well-formatted HTML/XML files. The package includes a lot of binary tools to extract or modify the data. For example:

$ curl -s http://example.com/ | hxselect title
<title>Example Domain</title>

Here is the example with provided data:

$ hxselect -c -s "\n" "td[align=right]" <file.html
<b>54</b>
<b>1</b>
0 (0/0)
<b>0</b>

Here is the final example with stripping out <b> tags:

$ hxselect -c -s "\n" "td[align=right]" <file.html | sed "s/<[^>]\+>//g"
54
1
0 (0/0)
0

For more examples, check the .

kenorb
  • 155,785
  • 88
  • 678
  • 743
4

You really should to use some real HTML parser for this job, like:

perl -Mojo -0777 -nlE 'say [split(/\s/, $_->all_text)]->[0] for x($_)->find("td[align=right]")->each'

prints:

54
1
0
0

But for this you need to have perl, and installed Mojolicious package.

(it is easy to install with:)

curl -L get.mojolicio.us | sh
clt60
  • 62,119
  • 17
  • 107
  • 194
4
$ awk -F'<td[^>]*>(<b>)?|(</?b>)?</td>' '$2~/[0-9]/{print $2+0}' file
54
1
0
0
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 7
    Good answers accompany code samples with an explanation for future readers. While the person asking this question may understand your answer, explaining how you arrived at it will help countless others. – Stonz2 Aug 18 '14 at 15:43
  • 1
    Thats fine but it takes about 15 seconds on average to crank out an answer and a few minutes to document it so I have time to do the former but not the latter for every question, especially the ones that IMHO are self-evident. If anyone has questions I'm happy to answer them. – Ed Morton Aug 18 '14 at 16:35
4

BSD/GNU grep/ripgrep

For simple extracting, you can use grep, for example:

  • Your example using grep:

    $ egrep -o "[0-9][^<]\?\+" file.html
    54
    1
    0 (0/0)
    0
    

    and using ripgrep:

    $ rg -o ">([^>]+)<" -r '$1' <file.html | tail +2
    54
    1
    0 (0/0)
    0
    
  • Extracting outer html of H1:

    $ curl -s http://example.com/ | egrep -o '<h1>.*</h1>'
    <h1>Example Domain</h1>
    

Other examples:

  • Extracting the body:

    $ curl -s http://example.com/ | xargs | egrep -o '<body>.*</body>'
    <body> <div> <h1>Example Domain</h1> ...
    

    Instead of xargs you can also use tr '\n' ' '.

  • For multiple tags, see: Text between two tags.

If you're dealing with large datasets, consider using ripgrep which has similar syntax, but it's a way faster since it's written in Rust.

kravietz
  • 10,667
  • 2
  • 35
  • 27
kenorb
  • 155,785
  • 88
  • 678
  • 743
2

I was recently pointed to pup, which in the limited testing I've done, is much more forgiving with invalid HTML and tag soup.

cat <<'EOF' | pup -c 'td + td text{}'
<table>
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
</table>
EOF

Prints:

54
1
0 (0/0)
0
greyfade
  • 24,948
  • 7
  • 64
  • 80
2

With , a true HTML parser, and XPath:

$ xidel -s "input.html" -e '//td[@align="right"]'
54
1
0 (0/0)
0

$ xidel -s "input.html" -e '//td[@align="right"]/tokenize(.)[1]'
# or
$ xidel -s "input.html" -e '//td[@align="right"]/extract(.,"\d+")'
54
1
0
0
Reino
  • 3,203
  • 1
  • 13
  • 21
1

ex/vim

For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching HTML tags, selecting/deleting inner/outer tags, and edit the content in-place.

Here is the command:

$ ex +"%s/^[^>].*>\([^<]\+\)<.*/\1/g" +"g/[a-zA-Z]/d" +%p -scq! file.html
54
1
0 (0/0)
0

This is how the command works:

  • Use ex in-place editor to substitute on all lines (%) by: ex +"%s/pattern/replace/g".

    The substitution pattern consists of 3 parts:

    • Select from the beginning of line till > (^[^>].*>) for removal, right before the 2nd part.
    • Select our main part till < (([^<]+)).
    • Select everything else after < for removal (<.*).
    • We replace the whole matching line with \1 which refers to pattern inside the brackets (()).
  • After substitution, we remove any alphanumeric lines by using global: g/[a-zA-Z]/d.

  • Finally, print the current buffer on the screen by +%p.
  • Then silently (-s) quit without saving (-c "q!"), or save into the file (-c "wq").

When tested, to replace file in-place, change -scq! to -scwq.


Here is another simple example which removes style tag from the header and prints the parsed output:

$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin

However, it's not advised to use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perl or PHP DOM).


See also:

kenorb
  • 155,785
  • 88
  • 678
  • 743
1

What about:

lynx -dump index.html
Jean
  • 11
  • 1
0

In the past I used PhantomJS but now you can do it with similar tools that are still maintained like Selenium WebDriver.

It makes it possible to use DOM API functions with JavaScript in a headless browser like Firefox or Chromium and you can call the script (written in Node.js or Python for example) in your shell script if you want to do additional treatment in the shell.

baptx
  • 3,428
  • 6
  • 33
  • 42