0

What command should I be using to extract the text from within the following html code which sits in a "test.html" file containing : "<span id="imAnID">extractme</span>" ?

The file will be larger so I need to point grep or sed to an id and then tell it to extract only the text from the tag having this ID. Assuming I run the terminal from the directory where the file resides, I am doing this:

cat test.html | sed -n 's/.*<span id="imAnID">\(.*\)<\/span>.*/\1/p'

What am I doing wrong? I get an empty output... Not opposed to using grep for this if it's easier.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Meh
  • 607
  • 1
  • 9
  • 19

4 Answers4

0

You can try doing it with awk instead:

  #!/bin/bash

  start_tag="span id=\"imAnID\""
  end_tag="/span"

  awk -F'[<>]' -v taga="$start_tag" -v tagb="$end_tag" '{ i=1; while (i<=NF) { if ($(i)==taga && $(i+2)==tagb) { print $(i+1) }; i++} }'

Use this by:

$ ./script < infile > outfile
sampson-chen
  • 45,805
  • 12
  • 84
  • 81
  • Found a good example of when the script does not work. Could you test this, and let me know if this works, that would be superb.I think it may be the way it is structured. I used a url from amazon as an example, here is the script: `#!/bin/bash wget -q -O ama.html "http://www.amazon.co.uk/Asus-GTX660-TI-DC2OG-2GD5-Borderlands-PCI-Express/dp/B008X36NHA/ref=sr_1_1?ie=UTF8&qid=1351617700&sr=8-1" start_tag="span id=\"priceLarge\"" end_tag="/span" awk -F'[<>]' -v taga="$start_tag" -v tagb="$end_tag" '{ i=1; while (i<=NF) { if ($(i)==taga && $(i+2)==tagb) { print $(i+1) }; i++} }' ama.html` – Meh Oct 30 '12 at 19:26
  • It should just return the price in pounds for the graphics card. Or at least that is the intention. – Meh Oct 30 '12 at 19:26
  • The problem could be also in the encoding - the wget you're using downloads the amazon page in iso8859-1 - on my terminal, I use UTF8 locales and sed expects input to be UTF8... you need to recode it, then at least my example works fine - just the price is not in span but in ... – Kamil Šrot Oct 30 '12 at 19:48
  • @Capt.Morgan, There's no rule against enhancing your original question. Why not just restructure your question with sample data that covers your expected cases, required output, and the code you've tried so far. Good luck. – shellter Oct 30 '12 at 23:06
0

using grep -o

echo "<span id="imAnID" hello>extractme</span> <span id='imAnID'>extractmetoo</span>" | grep -oE 'id=.?imAnID[^<>]*>[^<>]+' | cut -d'>' -f2

will find:

#=>extractme
#=>extractmetoo

it will work if the span element carrying the desired id attribute comes immediately before the extractme stuff.

Nik O'Lai
  • 3,586
  • 1
  • 15
  • 17
0

It is awkward to use awk, sed, or grep for this since these tools are line-based (one line at a time). Is it guaranteed that the span you are trying to extract is all on the same line? Is there any possibility of other tags used within the span (e.g. em tags)? If not, then this sounds like a job for perl.

djhaskin987
  • 9,741
  • 4
  • 50
  • 86
0

awk, sed and grep are line-oriented tools. XML and HTML are based on tags. The two don't combine that well, though you can get by with awk, sed and grep on XML and HTML by using a pretty formatter on the XML or HTML before resorting to your line-oriented tools.

There's a program called xmlgawk that is supposed to be quite gawk-like, while still working on XML.

I personally prefer to do this sort of thing in Python using the lxml module, so that the XML/HTML can be fully understood without getting too wordy.

user1277476
  • 2,871
  • 12
  • 10