1

I'm looking to create a quick script, but I've ran into some issues.

<li type="square"> Y </li>

I'm basically using wget to download a HTML file, and then trying to search the file for the above snippet. Y is dynamic and changes each time, so in one it might be "Dave", and in the other "Chris". So I'm trying to get the bash script to find

<li type="square"> </li>

and tell me what is inbetween the two. The general formatting of the file is very messy:

<html stuff tags><li type="square">Dave</li><more html stuff>
<br/><html stuff>   
<br/><br/><li type="square">Chris</li><more html stuff><br/>

I've been unable to come up with anything that works for parsing the file, and would really appreciate someone to give me a push in the right direction.

EDIT -

<div class="post">
                    <hr class="hrcolor" width="100%" size="1" />
                    <div class="inner" id="msg_4287022"><ul class="bbc_list"><li type="square">-dave</li><li type="square">-chris</li><li type="square">-sarah</li><li type="square">-amber</li></ul><br /></div>
                </div>

is the block of code that I'm looking to extract the names from. The "-" symbol is somethng added onto the list to minimize its scope, so I just get that list. The problem I'm having is that:

awk '{print $2}' FS='(<[^>]*>)+-' 4287022.html > output.txt

Only gives outputs the first list item, and not the rest.

4 Answers4

2

You generally should not use regex to parse html files.

Instead you can use my Xidel to perform pattern matching on it:

xidel 4287022.html -e '<li type="square">{.}</li>*'

Or with traditional XPath:

xidel 4287022.html -e '//li[@type="square"]'
BeniBela
  • 16,412
  • 4
  • 45
  • 52
  • Xidel looks like a great utility; unfortunately, as of 26 April 2013, there's no prebuilt `OS X` binary, but it can be compiled from source; see [here](https://www.evernote.com/shard/s69/sh/ff1e78f3-a369-4855-b18f-6184ce789c45/f3511927d0fb356ce883835f2eb712e0) – mklement0 Apr 26 '13 at 16:59
1

You could use grep -Eo "<li type=\"square\">-?(\w+)</li>" ./* for this.

Jack
  • 1,892
  • 1
  • 19
  • 23
  • +1 for the best approach short of using an HTML parser. Here's the full solution that extracts just the element *values*, excluding the initial "-" (i.e., in the OP's example, `dave`, `chris`, …): `grep -Eo "
  • -[^<]+" t.html | grep -Eo '>[^>]+' | cut -c 3-` Also note that there's no point in enclosing the `\w+` in the regex in `()`, because `grep -Eo` doesn't support capturing subgroups - at least in GNU grep 2.10 (Ubuntu 12.04) and BSD grep 2.5.1 (OS X 10.8.3).
  • – mklement0 Apr 22 '13 at 03:13