2

I try to print multiple patterns with sed.

Here's a typical string to process :

(<span class="arabic">1</span>.<span class="arabic">15</span>)</td></tr>

and I would like : (1.15)

For this, I tried :

sed 's/^(<span.*">\([0-9]*\).*\([0-9]*\).*">/(\1\.\2)/'

but I get (1.)15</span>)</td></tr>

Anyone could see what's wrong ?

Thanks

4 Answers4

1

If you are Chuck Norris, use , or . If you're not, don't use regex to parse HTML, instead, use a tool that support , like . In 2014, it's a solved problem :

xmllint --html --xpath '//span[@class="arabic"]/text()' file_or_URL

Check the famous RegEx match open tags except XHTML self-contained tags

xmllint comes from libxml2-utils package (for debian and derivatives)

Community
  • 1
  • 1
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
0

If data is at the same place all the time, awk may be a simpler solution than sed:

awk -F"[<>]" '{print "("$3"."$7")"}' file
(1.15)
Jotne
  • 40,548
  • 12
  • 51
  • 55
0

Reason why you are getting "(1.)15) as your output"

sed 's/^(<span.*">\([0-9]*\).*\([0-9]*\).*">/(\1\.\2)/'
                                          ^^

the two characters "> needs to be placed before \([0-9]*\) since "> in your line is before the two digits (in this case). This way sed can find the pattern

The correct sed command

sed 's/^(<span.*">\([0-9]*\).*">\([0-9]*\).*/(\1.\2)/'
                              ^^    

Correct Command line

echo '(<span class="arabic">1</span>.<span class="arabic">15</span>)</td></tr>'|sed 's/^(<span.*">\([0-9]*\).*">\([0-9]*\).*/(\1.\2)/'

results using the command line above

(1.15)
repzero
  • 8,254
  • 2
  • 18
  • 40
  • Thanks for your help. does ".*' mean all characters and include also > and " characters ? in other words, is it including special characters ? –  Dec 26 '14 at 17:08
  • .* means ALL characters including > and " to up to the pattern for example (.*">) means all characters up to the pattern "> – repzero Dec 26 '14 at 17:19
  • ok, so from I have understood, I have to indicate the characters preceding the pattern, i.e in my example, the characters "> ? –  Dec 27 '14 at 10:03
  • You can if all your lines are the same you can type each character upto ">. but in this case, I use .*"> because I don't know what characters are before the pattern ">. However I do know that "> is a pattern found on every line. As a result I match all characters upto "> – repzero Dec 27 '14 at 11:19
-1
$ lynx -dump -nomargins file.htm
(1.15)
Zombo
  • 1
  • 62
  • 391
  • 407