parse text using grep regex pull out text from multiple lines of text in a file

Question

I have a chunck of text in a file:

<tr bgcolor="#F9F9F9">
     <td align="left">8/7/2012 11:23:42 AM</td>
     <td align="left"><em>Here is the text I want to parse out</em></td>
     <td class="ra">9.00</td>
     <td class="ra">297.00</td>
     <td class="ra">0.00</td>
     <td class="ra">0.00</td>
     <td class="ra">$0.00</td>
     <td class="ra">$0.50</td>
     <td class="ra"></td>
 </tr>

using grep I would like to end up with the result being

Here is the text I want to parse out

Working on the code now I have

cat file.txt | grep -m 1 -oP '<em>[^</em>]*'

but that does not work... thanks for your help!

`cat file.txt | grep ...` can be simplified to `grep ... file.txt`. — Asaph, Aug 07 '12 at 17:13
Do note that while what you want to do is possible (as demonstrated in answers below), [regex is generally not the right tool to parse XML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). For more robust solutions, use tools such as [xmlstarlet](http://xmlstar.sourceforge.net/) or a language that gives you access to a proper XML parser. — Shawn Chin, Aug 07 '12 at 17:20

Lev Levitsky · Accepted Answer · 2012-08-07T18:20:40.543

4

A correct regex would be (?<=<em>).*?(?=</em>).

So, try:

grep -m 1 -oP '(?<=<em>).*?(?=</em>)' file.txt

edited Aug 07 '12 at 18:20

answered Aug 07 '12 at 17:17

Lev Levitsky

63,701
20
147
175

That gives me this Here is the text I want to parse out 9.00 297.00 0.00 0.00 $0.00 $0.50 – Greg Alexander Aug 07 '12 at 18:13
OK, so what it is doing is going to the last which is in another text block below it... should have mentioned that, so I need the end to be the first occurrence of ... make sense? – Greg Alexander Aug 07 '12 at 18:19
@GregAlexander That could happen if the XML was all in one line, rather than nicely formatted as you show. Try to add a `?` after `*` as I did in the edit. – Lev Levitsky Aug 07 '12 at 18:22

parse text using grep regex pull out text from multiple lines of text in a file

1 Answers1