1

I'm trying to extract the price from this line:

<div class="bpi-value bpiUSD">$634.17</div>

I would like to output:

634.17

I've tried:

sed -n "/$/,/</p"

In the hope of extracting everything between the $ and the < but it isn't working. I'm thinking the reason for this may be that the dollar sign is being interpreted as a variable or something else. What would be the best way of doing this?

treetop
  • 165
  • 1
  • 13

3 Answers3

3

The Right Way to extract content from markup languages is using syntax-aware tools:

read -r var < <(xmlstarlet sel -t -m '//div[@class="bpi-value bpiUSD"]' -v . <in.xhtml)
var=${var#'$'} # strip leading $

However, if you must, and you're processing only a single line, use bash's native built-in string manipulation primitives rather than paying startup cost for an external tool such as sed:

line='<div class="bpi-value bpiUSD">$634.17</div>'
var=${line#*$}   # delete everything up and including to first $
var=${var%%'<'*} # delete everything after the first remaining <

See also:

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
1

sed handles regular expressions and the '$' means "end of line". The shortest sed line that will work (assuming your lines are well behaved)

$ echo '<div class="bpi-value bpiUSD">$634.17</div>' | sed 's/.*\$\(.*\)<.*/\1/'
634.17
b-jazz
  • 859
  • 1
  • 7
  • 16
0

I agree with Biffen. However, if your lines are fixed-format,

sed 's/^[^$]\+\(\$[0-9.]\{1,\}\).*$/\1/' <input filename>

should do it. It skips to the $ (\$ in sed), keeps the $ followed by digits or periods (\(\$[0-9.]\{1,\}\)), and then clears out to the end. Tested on GNU sed 4.2.2 in bash.

(fixed) the first version of this answer didn't have enough backslashes.

cxw
  • 16,685
  • 2
  • 45
  • 81
  • 1
    Being able to access ERE features in BRE by adding backslashes is a GNU extension. Might want to stick to pure BRE if portability is a goal, or use `sed -r` (GNU) / `sed -E` (BSD) and use ERE explicitly. – Charles Duffy Oct 11 '16 at 17:02