0

I have a bash script:

v1='value="'
v2='" type'

do_parse_html_file() {
   sed -n "s/.*${v1}//;s/${v2}.*//p" "${_SCRIPT_PATH}/IBlockListLists.html"|egrep '^http' >${_tmp_file}
}

... which is extracting from html file only URLs. I would like to have on output:

somename URL
somename URL

--- example of the input html file is like the following:

</tr>
<tr class="alt01">
<td><b><a href="http://www.iblocklist.com/list.php?list=bcoepfyewziejvcqyhqo">iana-reserved</a></b></td>
<td>Bluetack</td>
<td><img style="border:0;" src="I-BlockList%20%7C%20Lists_files/star_4.png" alt="" height="15" width="75"></td>
<td><input style="width:200px; outline:none; border-style:solid; border-width:1px; border-color:#ccc;" id="bcoepfyewziejvcqyhqo" readonly="readonly" onclick="select_text('bcoepfyewziejvcqyhqo');" value="http://list.iblocklist.com/?list=bcoepfyewziejvcqyhqo&amp;fileformat=p2p&amp;archiveformat=gz" type="text"></td>
</tr>
<tr class="alt02">
<td><b><a href="http://www.iblocklist.com/list.php?list=cslpybexmxyuacbyuvib">iana-private</a></b></td>
<td>Bluetack</td>
<td><img style="border:0;" src="I-BlockList%20%7C%20Lists_files/star_4.png" alt="" height="15" width="75"></td>
<td><input style="width:200px; outline:none; border-style:solid; border-width:1px; border-color:#ccc;" id="cslpybexmxyuacbyuvib" readonly="readonly" onclick="select_text('cslpybexmxyuacbyuvib');" value="http://list.iblocklist.com/?list=cslpybexmxyuacbyuvib&amp;fileformat=p2p&amp;archiveformat=gz" type="text"></td>
</tr>

--- result should be like following:

iana-reserved http://list.iblocklist.com/?list=bcoepfyewziejvcqyhqo&fileformat=p2p&archiveformat=gz iana-private http://list.iblocklist.com/?list=cslpybexmxyuacbyuvib&fileformat=p2p&archiveformat=gz

---Is it possible to have it by sed on one line command ? If so, please help.

the first part of the list - the "somename" is always first than following by the URL sitting on the next /does not have to be second/ line.

>somename   ... is delimited by   'href="URL">'   and   '</a>'       on one line           
>URL ... is always delimited by   'value="'       and   '" type'     on any following line 

thank you,
kind regards.
M.

masuch
  • 345
  • 1
  • 2
  • 8

3 Answers3

2

With my cli html parser Xidel it is a single line:

xidel "${_SCRIPT_PATH}/IBlockListLists.html" -e '//a/concat(., " ", @href)'
BeniBela
  • 16,412
  • 4
  • 45
  • 52
1

is not the right tool to do this.

I can show you some scripts to do it in or (ruby, java, php too) with a HTML parser. These are the right tools for this job.

This is the question maybe the most discussed on this web site, see this excellent post

One of that guys that makes this web site, wrote this too

Community
  • 1
  • 1
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
0

Use a parser. There are many of them, here an example using HTML::TokeParser.

Content of script.pl:

#!/usr/bin/env perl

use warnings;
use strict;
use HTML::TokeParser;

my $p = HTML::TokeParser->new( shift ) || die;

while ( my $tag = $p->get_tag( 'a' ) ) { 
    printf qq|%s %s\n|, $p->get_text, $tag->[1]{href};
}

Run it like:

perl-5.14.2 script.pl htmlfile

That yields:

iana-reserved http://www.iblocklist.com/list.php?list=bcoepfyewziejvcqyhqo
iana-private http://www.iblocklist.com/list.php?list=cslpybexmxyuacbyuvib
Birei
  • 35,723
  • 2
  • 77
  • 82