Extract value between $ and < in line of text using sed or alternative function

Question

I'm trying to extract the price from this line:

<div class="bpi-value bpiUSD">$634.17</div>

I would like to output:

634.17

I've tried:

sed -n "/$/,/</p"

In the hope of extracting everything between the $ and the < but it isn't working. I'm thinking the reason for this may be that the dollar sign is being interpreted as a variable or something else. What would be the best way of doing this?

[Don't parse HTML with regex!](http://stackoverflow.com/a/1732454/418066) — Biffen, Oct 11 '16 at 16:42
with grep and pcre `echo '
$634.17
' | grep -oP '\$\K[^<]+'` — Sundeep, Oct 12 '16 at 09:52

Charles Duffy · Answer 1 · 2016-10-11T17:08:47.833

The Right Way to extract content from markup languages is using syntax-aware tools:

read -r var < <(xmlstarlet sel -t -m '//div[@class="bpi-value bpiUSD"]' -v . <in.xhtml)
var=${var#'$'} # strip leading $

However, if you must, and you're processing only a single line, use bash's native built-in string manipulation primitives rather than paying startup cost for an external tool such as sed:

line='<div class="bpi-value bpiUSD">$634.17</div>'
var=${line#*$}   # delete everything up and including to first $
var=${var%%'<'*} # delete everything after the first remaining <

See also:

The bash-hackers page on parameter expansion (the specific string-manipulation syntax used above).
The Wooledge BashGuide on paramater expansion

score 1 · Accepted Answer · answered Oct 11 '16 at 16:53

1

sed handles regular expressions and the '$' means "end of line". The shortest sed line that will work (assuming your lines are well behaved)

$ echo '<div class="bpi-value bpiUSD">$634.17</div>' | sed 's/.*\$\(.*\)<.*/\1/'
634.17

answered Oct 11 '16 at 16:53

b-jazz

859
1
7
16

In the backslash in `\$` needed? – Walter A Oct 11 '16 at 21:52
Yes, you have to escape the '$' or else it is treated like the end of line. – b-jazz Oct 13 '16 at 04:09

score 0 · Answer 3 · answered Oct 11 '16 at 16:44

0

I agree with Biffen. However, if your lines are fixed-format,

sed 's/^[^$]\+\(\$[0-9.]\{1,\}\).*$/\1/' <input filename>

should do it. It skips to the $ (\$ in sed), keeps the $ followed by digits or periods ($\$[0-9.]\{1,\}$), and then clears out to the end. Tested on GNU sed 4.2.2 in bash.

(fixed) the first version of this answer didn't have enough backslashes.

answered Oct 11 '16 at 16:44

cxw

16,685
2
45
81

1

Being able to access ERE features in BRE by adding backslashes is a GNU extension. Might want to stick to pure BRE if portability is a goal, or use `sed -r` (GNU) / `sed -E` (BSD) and use ERE explicitly. – Charles Duffy Oct 11 '16 at 17:02

Extract value between $ and < in line of text using sed or alternative function

3 Answers3