extract text from between html tags with specific id using sed or grep

Question

What command should I be using to extract the text from within the following html code which sits in a "test.html" file containing : "<span id="imAnID">extractme</span>" ?

The file will be larger so I need to point grep or sed to an id and then tell it to extract only the text from the tag having this ID. Assuming I run the terminal from the directory where the file resides, I am doing this:

cat test.html | sed -n 's/.*<span id="imAnID">\(.*\)<\/span>.*/\1/p'

What am I doing wrong? I get an empty output... Not opposed to using grep for this if it's easier.

yes but the final file has other HTML code inside of it, and at that point the above command yields null... — Meh, Oct 30 '12 at 19:06
Just shooting on the flying bird, but maybe you're trying to match using regexp over multiple lines? Try to prepend N; to your sed pattern... line 'N;s/.* — Kamil Šrot, Oct 30 '12 at 19:33

score 0 · Answer 1 · answered Oct 30 '12 at 19:09

0

You can try doing it with awk instead:

  #!/bin/bash

  start_tag="span id=\"imAnID\""
  end_tag="/span"

  awk -F'[<>]' -v taga="$start_tag" -v tagb="$end_tag" '{ i=1; while (i<=NF) { if ($(i)==taga && $(i+2)==tagb) { print $(i+1) }; i++} }'

Use this by:

$ ./script < infile > outfile

answered Oct 30 '12 at 19:09

sampson-chen

45,805
12
84
81

Found a good example of when the script does not work. Could you test this, and let me know if this works, that would be superb.I think it may be the way it is structured. I used a url from amazon as an example, here is the script: `#!/bin/bash wget -q -O ama.html "http://www.amazon.co.uk/Asus-GTX660-TI-DC2OG-2GD5-Borderlands-PCI-Express/dp/B008X36NHA/ref=sr_1_1?ie=UTF8&qid=1351617700&sr=8-1" start_tag="span id=\"priceLarge\"" end_tag="/span" awk -F'[<>]' -v taga="$start_tag" -v tagb="$end_tag" '{ i=1; while (i<=NF) { if ($(i)==taga && $(i+2)==tagb) { print $(i+1) }; i++} }' ama.html` – Meh Oct 30 '12 at 19:26
It should just return the price in pounds for the graphics card. Or at least that is the intention. – Meh Oct 30 '12 at 19:26
The problem could be also in the encoding - the wget you're using downloads the amazon page in iso8859-1 - on my terminal, I use UTF8 locales and sed expects input to be UTF8... you need to recode it, then at least my example works fine - just the price is not in span but in ... – Kamil Šrot Oct 30 '12 at 19:48
@Capt.Morgan, There's no rule against enhancing your original question. Why not just restructure your question with sample data that covers your expected cases, required output, and the code you've tried so far. Good luck. – shellter Oct 30 '12 at 23:06

Nik O'Lai · Answer 2 · 2014-10-20T15:01:20.357

0

using grep -o

echo "<span id="imAnID" hello>extractme</span> <span id='imAnID'>extractmetoo</span>" | grep -oE 'id=.?imAnID[^<>]*>[^<>]+' | cut -d'>' -f2

will find:

#=>extractme
#=>extractmetoo

it will work if the span element carrying the desired id attribute comes immediately before the extractme stuff.

edited Oct 20 '14 at 15:01

answered Oct 30 '12 at 21:43

Nik O'Lai

3,586
1
15
17

Ok; and how to use sed to, instead of extracting, replacing? Replacing any text inside a tag id. – dani 'SO learn value newbies' Dec 01 '21 at 19:55

score 0 · Answer 3 · answered Oct 30 '12 at 21:46

It is awkward to use awk, sed, or grep for this since these tools are line-based (one line at a time). Is it guaranteed that the span you are trying to extract is all on the same line? Is there any possibility of other tags used within the span (e.g. em tags)? If not, then this sounds like a job for perl.

score 0 · Answer 4 · answered Oct 30 '12 at 22:35

awk, sed and grep are line-oriented tools. XML and HTML are based on tags. The two don't combine that well, though you can get by with awk, sed and grep on XML and HTML by using a pretty formatter on the XML or HTML before resorting to your line-oriented tools.

There's a program called xmlgawk that is supposed to be quite gawk-like, while still working on XML.

I personally prefer to do this sort of thing in Python using the lxml module, so that the XML/HTML can be fully understood without getting too wordy.

extract text from between html tags with specific id using sed or grep

4 Answers4

Linked