0

i have 1000 files like this :

text1.txt

<span class="store-time">OPEN SINCE <em>Aug 9, 2010</em></span>

text2.txt

<span class="store-time">OPEN SINCE <em>Aug 9, 2012</em></span>

i want extract all Dates from 1000 files, each one in new line like this :

Aug 9, 2010
Aug 9, 2012
...
H.Otmane
  • 33
  • 6

2 Answers2

0

If you are certain that your files all have exactly this format, you can use a simple sed expression

sed -E -e 's/^<span class="store-time">OPEN SINCE <em>([A-Z][a-z]+ *[0-9]+, *[0-9]+)<\/em><\/span>/\1/' 

It just finds the start of your line, followed by something that looks like a date (letters followed by a space and a number, followed by a comma, and again a number), and the end of your line.
cat all your files and send the result to sed input and you get the list of dates.

But as pointed in the comments, parsing xml files can be problematic (see for instance RegEx match open tags except XHTML self-contained tags ). If the xml tags are spread on several lines, the script will fail extracting the information, for instance with the following data

<span class="store-time">
OPEN SINCE <em>Aug 9, 2012</em>
</span>

To deal with such situations, there are more more powerful tools. As the collection of tools xmlstarlet or perl class like XML::libXML. THey are able to perform a more crash-proof parsing but they are more complex to use.

If you are definitely sure that all your files have the proper formatting, the sed script can solve your problem.

Alain Merigot
  • 10,667
  • 3
  • 18
  • 31
0

Well, for parsing XML tools such as awk or sed are for sure not the first choice because they are rather line-based and XML isn't.

To get your job done in awk you could use something like:

awk '$0 ~ /<span class="store-time">.*/ {gsub(/^.*<em>/,"",$0) gsub(/<\/em>.*/,"",$0); print $0}' *.html

This command takes all html files (*.html) and searches for lines starting with <span class="store-time">. It then replaces everything from the start of the line up to the first <em> by an empty string. It does the same for everything after </em> (including </em>)

F. Knorr
  • 3,045
  • 15
  • 22