Parse HTML Using AWK

Question

I have the following HTML strcuture and want to extract data from it using the awk.

<body>
<div>...</div>
<div>...</div>
<div class="body-content">
    <div>...</div>
    <div class="product-list" class="container">
        <div class="w3-row" id="product-list-row">
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product A</div>
                    <div class="product-price">100,56</div>
                </div>
            </div>
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product B</div>
                    <div class="product-price">200,56</div>
                </div>
            </div>
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product C</div>
                    <div class="product-price">300,56</div>
                </div>
            </div>
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product D</div>
                    <div class="product-price">400,56</div>
                </div>
            </div>
        </div>
    </div>
</div>
</body>

The result I want to have is as follows.

I was experimenting with the following awk script (I know it makes no sense to select product-price twice, I was about to modify this script)

awk -F '<[^>]+>' 'found { sub(/^[[:space:]]*/,";"); print title $0; found=0 } /<div class="product-price">/ { title=$2 } /<div class="product-price">/  { found=1 }'

but it gives me the result

100,56                </div>
200,56                </div>
300,56                </div>
400,56                </div>

I never used awk before, so can't just figure out what is wrong here or how to modify the above code. How would you do this?

Can you use a tool that understands `xml` instead, e.g. `xmlstarlet`? — Ed Morton, Jun 27 '21 at 17:34
Awk is a great tool for many sorts of text searching, but it is not well-suited for hierarchical structures like HTML. You'd be much better off with a tool designed for the job. @Ed Morton's suggestion `xmlstarlet` is a fine choice for use from the shell. Alternatively, if you know any scripting languages (e.g. Perl, Python, Ruby, Javascript, ..) most of them have installable libraries for HTML parsing. — Mark Reed, Jun 27 '21 at 17:55
Actually, GNU awk has an XML library too - see http://gawkextlib.sourceforge.net/xml/xml.html. — Ed Morton, Jun 27 '21 at 17:57
@EdMorton true, though last I checked installing gawk add-ons was not as straightforward as using cpanm, pip, gem, npm, etc. — Mark Reed, Jun 28 '21 at 04:37

RavinderSingh13 · Accepted Answer · 2021-06-27T18:04:10.530

3

With your shown samples/attempts, please try following awk code.

awk -F"[><]" '{gsub(/\r/,"")} /^[ \t]+<div[ \t]+class="product-price">.*<\/div>/{print $3}' Input_file

Explanation: Adding detailed explanation for above. This is only for explanation purposes for running code please use above one.

awk -F"[><]" '      ##Starting awk program from here and setting field separator as ><
{gsub(/\r/,"")}     ##Substituting control M chars at last of lines.
/^[ \t]+<div[ \t]+class="product-price">.*<\/div>/{ ##checking condition if line starts
                    ##from space followed by <div class=product-price"> till div close tag.
  print $3          ##printing 3rd column here.
}
' Input_file        ##Mentioning Input_file name here.

Changed regex to /^[ \t]+<div[ \t]+class as per Ed's suggestions in comments. Also its always recommended by experts to use xmlstarlet/xml aware tools in case someone has in their system.

edited Jun 27 '21 at 18:04

answered Jun 27 '21 at 17:34

RavinderSingh13

130,504
14
57
93

@Javiator, with your shown samples, this gives me correct rules. can you please run a command `cat -v file` to see if you have control M characters in your file? – RavinderSingh13 Jun 27 '21 at 17:39
1

@RavinderSingh13, nice catch! The file contains control M characters. – Said Savci Jun 27 '21 at 17:40
@Javiator, oh ok, please try my edited code once and let me know if this edited solution works for you? – RavinderSingh13 Jun 27 '21 at 17:41
1

control-Ms in the input would not cause Ravinders original script to produce no output, it'd work just fine either way since it's not doing anything with the char at the end of each line. – Ed Morton Jun 27 '21 at 17:44
@EdMorton, may be control M characters are in between somewhere too and I need to use `gsub` for it? I don't remember but somewhere I had seen control M in between lines too at that time it may fail. – RavinderSingh13 Jun 27 '21 at 17:46
Maybe but I've never come across that and then what else might there be in the file? – Ed Morton Jun 27 '21 at 17:48
@RavinderSingh13, unfortunately, it does not work yet. I updated my HTML snippet (I had excluded some details, because I thought they wouldn't be important). – Said Savci Jun 27 '21 at 17:49
1

Reading the tea leaves - control-Ms are not the problem. – Ed Morton Jun 27 '21 at 17:49
@Javiator, could you please check `awk -F"[><]" '{gsub(/\r/,"")} /^[[:space:]]+
.*<\/div>/{print $3}' Input_file` once and let me know if this helps? I have updated same in my answer too now.
– RavinderSingh13 Jun 27 '21 at 17:51
1

@Javiator If you didn't make a mistake copy/pasting the script and your real input does look like the example you provided then my best guess is either a) that's not a blank after `div` or, more likely b) you're using an awk that doesn't understand character classes. Try changing `/^[[:space:]]+
– Ed Morton Jun 27 '21 at 17:53
I'm curious though - does the `xmlstarlet` command I posted produce the output you want? – Ed Morton Jun 27 '21 at 17:55
2

@EdMorton, by changing `/^[[:space:]]+
– Said Savci Jun 27 '21 at 17:56
@Javiator, In case you have python3x, along with capability to install packages then you could try my additional solution https://stackoverflow.com/a/68154121/5866580 once too, cheers. – RavinderSingh13 Jun 27 '21 at 18:26

Ed Morton · Answer 2 · 2021-06-27T17:42:41.047

3

The result of a quick google for xmlstarlet print div contents and then a few secs of trial and error:

$ xmlstarlet sel -t -m "//*[@class='product-price']" -v "." -n file
100,56
200,56
300,56
400,56

For an explanation - ask google :-).

edited Jun 27 '21 at 17:42

answered Jun 27 '21 at 17:37

Ed Morton

188,023
17
78
185

I just installed `xmlstarlet ` and tried to test it, but unfortunately the server gives me an HTML that is not well-formed. But I'll still upvote your answer! – Said Savci Jun 27 '21 at 18:00
That's far more likely to be a problem for an awk script than an XML-aware tool. That's WHY you should use an XML-aware tool. – Ed Morton Jun 27 '21 at 18:01

RavinderSingh13 · Answer 3 · 2021-06-27T18:28:59.847

If someone is looking for Python related solution, I would suggest use beautifulsoup library of Python, following is written and tested in Python3.8. To segregate it from my previous answer I am adding another answer here.

#!/bin/python3
##import library here.  
from bs4 import BeautifulSoup
##Read Input_file and get its all contents.
with open('Input_file', 'r') as f:
    contents = f.read()
    f.close()
##Get contents in form of xml in soup variable here.
soup = BeautifulSoup(contents, 'lxml')
##get only those values which specifically needed by OP of div class.
vals = (soup.find_all("div", {"class": "product-price"}))
##Print actual values out of tags.
for val in vals:
    print (val.text)

NOTE:

One should have BeautifulSoup installed in Python along with install lxml with pip3 or pip depending upon your system.
Where Input_file is one where program is reading your all data.

score 2 · Answer 4 · answered Jun 28 '21 at 07:52

How would you do this?

If possible use tool designed for dealing with HTML, which GNU AWK is not.

If you are allowed to install then use hxselect it does process standard input and understand (subset) of CSS selectors, so in this case something like:

echo file.html | hxselect -i -c -s '\n' div.product-price

should give you desired result (disclaimer: I do not have ability to test it)

score 2 · Answer 5 · answered Jul 02 '21 at 12:34

It baffles me that time and time again people try to parse HTML, not with an HTML parser, but with a tool that doesn't understand HTML at all in general and with RegEx in particular!
With an HTML parser like xidel it's as simple as:

$ xidel -s "<url> or input.html" -e '//div[@class="product-price"]'

Parse HTML Using AWK

5 Answers5