Using AWK/Grep/Bash to extract data from HTML

Question

I'm trying to make a Bash script to extract results from an HTML page. I achieved to get the content of the page with Curl, but the next step is parsing the output, which is problematic.

The interesting content of the page looks like this:

<div class="result">
    ...
                <div class="item">
                    <div class="item_title">ITEM 1</div>
                </div>
                ...                                 
                <div class="item_desc">
                    ITEM DESCRIPTION 1
                </div>
...              
</div>
<div class="result">
    ...
                <div class="item">
                    <div class="item_title">ITEM 2</div>
                </div>
                ...                                 
                <div class="item_desc">
                    ITEM DESCRIPTION 2
                </div>
    ...              
</div>

I'd like to output something like:

ITEM1;ITEM DESCRIPTION 1
ITEM2;ITEM DESCRIPTION 2

I know a bit of Grep, but I can't wrap my mind about making it to work here, also some people told me to use Awk, which seems best suited for this kind of task.

I'd appreciate any help.

Thank you very much.

Are you only allowed to use awk and grep? other language, such as python, perl provides you a very good library to achieve your goal. — chinuy, May 19 '14 at 20:42
You'd probably do best with a tool to parse HTML properly. For example, is the `
` always on the same line as the matching `
`? Is the `
` always on a line on its own and the matching `
` also on a line on its own? Is the item description always one line? — Jonathan Leffler, May 19 '14 at 20:43
@chinuy No, I can't use python or perl, which I'd rather use but well. Only awk/grep/sed, what is from the coreutils I think. — NerdNot, May 19 '14 at 20:45
@JonathanLeffler , thanks for your comment. Yes, the HTML page is well structured, and consistently so. My problem is really extracting multiple blocks matching a pattern. — NerdNot, May 19 '14 at 20:48
[How about this?](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — gniourf_gniourf, May 19 '14 at 21:11
@gniourf_gniourf: I'm only interested in the bits I showed here. The elements (divs etc.) are well structured. There's no change in format. — NerdNot, May 19 '14 at 21:22

Jonathan Leffler · Accepted Answer · 2014-05-19T21:17:19.113

A bare minimal program to handle the HTML, loosely, with no validation, and easily confused by variations in the HTML, is:

sed.script

/ *<div class="item_title">\(.*\)<\/div>/ { s//\1/; h; }
/ *<div class="item_desc">/,/<\/div>/ {
    /<div class="item_desc">/d
    /<\/div>/d
    s/^  *//
    G
    s/\(.*\)\n\(.*\)/\2;\1/p
}

The first line matches item title lines. The s/// command captures just the part between the <div …> and </div>; the h copies that into the hold space (memory).

The rest of the script matches lines between the item description <div> and its </div>. The first two lines delete (ignore) the <div> and </div> lines. The s/// removes leading spaces; the G appends the hold space to the pattern space after a newline; the s///p captures the part before the newline (the description) and the part after the newline (the title from the hold space), and replaces them with the title and description, separated by a semi-colon, and prints the result.

Example

$ sed -n -f sed.script items.html
ITEM 1;ITEM DESCRIPTION 1
ITEM 2;ITEM DESCRIPTION 2
$

Note the -n; that means "don't print unless told to do so".

You can do it without a script file, but there's less to worry about if you use one. You can probably even squeeze it all onto one line if you're careful. Beware that the ; after the h is necessary with BSD sed and harmless but not crucial with GNU sed.

Modification

There are all sorts of ways to make it more nearly bullet-proof (but it is debatable whether they're worthwhile). For example:

/ *<div class="item_title">\(.*\)<\/div>/

could be revised to:

/^[[:space:]]*<div class="item_title">[[:space:]]*\(.*\)[[:space:]]*<\/div>[[:space:]]*$/

to deal with arbitrary sequences of white space before, in the middle, and after the <div> components. Repeat ad nauseam for the other regexes. You could arrange to have single spaces between words. You could arrange for a multi-line description to be printed just once as a single line, rather than each line segment being printed separately as it would be now.

You could also wrap the whole construct in the file inside:

/^<div class="result">$/,/^<\/div>$/ {
    …script as before…
}

And you could repeat that idea so that the item title is only picked inside <div class="item"> and </div>, etc.

Thank you VERY MUCH for your answer, it does the job perfectly. Thank you again. — NerdNot, May 19 '14 at 21:20

Ed Morton · Answer 2 · 2014-05-19T23:47:21.570

3

Just use awk:

awk -F '<[^>]+>' '
    found { sub(/^[[:space:]]*/,";"); print title $0; found=0 }
    /<div class="item_title">/ { title=$2 }
    /<div class="item_desc">/  { found=1 }
' file
ITEM 1;ITEM DESCRIPTION 1
ITEM 2;ITEM DESCRIPTION 2

edited May 19 '14 at 23:47

answered May 19 '14 at 23:41

Ed Morton

188,023
17
78
185

Using AWK/Grep/Bash to extract data from HTML

2 Answers2

sed.script

Example

Modification

Linked

Related