Using bash in order to extract data from a HTML forum list

Question

I'm looking to create a quick script, but I've ran into some issues.

<li type="square"> Y </li>

I'm basically using wget to download a HTML file, and then trying to search the file for the above snippet. Y is dynamic and changes each time, so in one it might be "Dave", and in the other "Chris". So I'm trying to get the bash script to find

<li type="square"> </li>

and tell me what is inbetween the two. The general formatting of the file is very messy:

<html stuff tags><li type="square">Dave</li><more html stuff>
<br/><html stuff>   
<br/><br/><li type="square">Chris</li><more html stuff><br/>

I've been unable to come up with anything that works for parsing the file, and would really appreciate someone to give me a push in the right direction.

EDIT -

<div class="post">
                    <hr class="hrcolor" width="100%" size="1" />
                    <div class="inner" id="msg_4287022"><ul class="bbc_list"><li type="square">-dave</li><li type="square">-chris</li><li type="square">-sarah</li><li type="square">-amber</li></ul><br /></div>
                </div>

is the block of code that I'm looking to extract the names from. The "-" symbol is somethng added onto the list to minimize its scope, so I just get that list. The problem I'm having is that:

awk '{print $2}' FS='(<[^>]*>)+-' 4287022.html > output.txt

Only gives outputs the first list item, and not the rest.

Don't parse HTML with regex. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — tripleee, Apr 21 '13 at 05:52

score 2 · Answer 1 · answered Apr 21 '13 at 08:15

2

You generally should not use regex to parse html files.

Instead you can use my Xidel to perform pattern matching on it:

xidel 4287022.html -e '<li type="square">{.}</li>*'

Or with traditional XPath:

xidel 4287022.html -e '//li[@type="square"]'

answered Apr 21 '13 at 08:15

BeniBela

16,412
4
45
52

Xidel looks like a great utility; unfortunately, as of 26 April 2013, there's no prebuilt `OS X` binary, but it can be compiled from source; see [here](https://www.evernote.com/shard/s69/sh/ff1e78f3-a369-4855-b18f-6184ce789c45/f3511927d0fb356ce883835f2eb712e0) – mklement0 Apr 26 '13 at 16:59

Jack · Answer 2 · 2013-04-21T23:44:18.327

1

You could use grep -Eo "<li type=\"square\">-?(\w+)</li>" ./* for this.

edited Apr 21 '13 at 23:44

answered Apr 21 '13 at 03:49

Jack

1,892
1
19
23

+1 for the best approach short of using an HTML parser. Here's the full solution that extracts just the element *values*, excluding the initial "-" (i.e., in the OP's example, `dave`, `chris`, …): `grep -Eo "
-[^<]+" t.html | grep -Eo '>[^>]+' | cut -c 3-` Also note that there's no point in enclosing the `\w+` in the regex in `()`, because `grep -Eo` doesn't support capturing subgroups - at least in GNU grep 2.10 (Ubuntu 12.04) and BSD grep 2.5.1 (OS X 10.8.3).

mklement0

Apr 22 '13 at 03:13

Zombo · Answer 3 · 2013-04-21T05:23:07.810

0

awk '{print $2,$3,$4,$5}' FS='(<[^>]*>)+' 4287022.html

This presents the HTML page as a table. However instead of runs of whitespace as the Field Separator, runs of HTML tags are the Field Separator. The first field in this case is the empty space at the beginning of the line. The second field in the case is the Name, so we print this.

Result

-dave -chris -sarah -amber

edited Apr 21 '13 at 05:23

answered Apr 21 '13 at 03:50

Zombo

1
62
391
407

score 0 · Answer 4 · answered Apr 21 '13 at 04:59

0

Using sed:

sed -n 's/.*<li type="square"> *\([^<]*\).*/\1/p' input.html

answered Apr 21 '13 at 04:59

perreal

94,503
21
155
181

Neat, but this will only work if each `
` element is on its own line.

mklement0

Apr 21 '13 at 05:30

Using bash in order to extract data from a HTML forum list

4 Answers4