3

I want to write a bash script that finds a pattern in a html-file which is going over multiple lines.

File for regex:

<td class="content">
  some content
</td>
<td class="time">
  13.05.2013  17:51
</td>
<td class="author">
  A Name
</td>

Now I want to find the content of <td>-tag with the class="time".

So in principle the following regex:

<td class="time">(\d{2}\.\d{2}\.\d{4}\s+\d{2}:\d{2})</td>

grep seems not to be the command I can use, because...

  1. It only returns the complete line or the complete result using -o and not only the result inside the round brackets (...).
  2. It looks only in one line for a pattern

So how is it possible that I will get only a string with 13.05.2013 17:51?

Sven Richter
  • 413
  • 3
  • 8
  • 15
  • 1
    googled 'command line xml parser' and found http://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash – Mike Makuch Sep 15 '13 at 02:08

3 Answers3

2

It's not quite there, it prints a leading newline for some reason, but maybe something like this?

$ sed -n '/<td class="time">/,/<\/td>/{s/^<td class="time">$//;/^<\/td>$/d;p}' file 

13.05.2013  17:51

Inspired by https://stackoverflow.com/a/13023643/1076493

Edit: Well, there's always perl!
For more info see https://stackoverflow.com/a/1213996/1076493

$ perl -0777 -ne 'print "$1\n" while /<td class="time">\n  (.*?)\n<\/td>/gs' regex.txt 
13.05.2013  17:51
Community
  • 1
  • 1
timss
  • 9,982
  • 4
  • 34
  • 56
0

How fixed is your format? If you're sure it's going to look like that then you can use sed to match the first line, get the next line and print it, like this:

$  sed -n '/<td *class="time">/{n;p}' test
  13.05.2013  17:51

You could add something to cover the case where it's on the same line as well. Alternatively pre-process the file to strip all the newlines, maybe collapse the whitespace too (can't be done with sed apparently) and then go from there.

However, if it's an HTML file from somewhere else and you can't be sure of the format I'd consider using some other scripting language that has a library to parse XML, otherwise any solution is liable to break when the format changes.

Edited to add a link to my favorite sed resource for this sort of thing:http://www-rohan.sdsu.edu/doc/sed.html

SpaceDog
  • 3,249
  • 1
  • 17
  • 25
0

Try:

awk '/^td class="time">/{gsub(ORS,x); print $2}' RS=\< FS=\> file

or

awk '/^td class="time">/{print $2}' ORS= RS=\< FS='>[[:space:]]*' file
Scrutinizer
  • 9,608
  • 1
  • 21
  • 22