If you are certain that your files all have exactly this format, you can use a simple sed
expression
sed -E -e 's/^<span class="store-time">OPEN SINCE <em>([A-Z][a-z]+ *[0-9]+, *[0-9]+)<\/em><\/span>/\1/'
It just finds the start of your line, followed by something that looks like a date (letters followed by a space and a number, followed by a comma, and again a number), and the end of your line.
cat
all your files and send the result to sed
input and you get the list of dates.
But as pointed in the comments, parsing xml files can be problematic (see for instance RegEx match open tags except XHTML self-contained tags ). If the xml tags are spread on several lines, the script will fail extracting the information, for instance with the following data
<span class="store-time">
OPEN SINCE <em>Aug 9, 2012</em>
</span>
To deal with such situations, there are more more powerful tools. As the collection of tools xmlstarlet
or perl
class like XML::libXML
. THey are able to perform a more crash-proof parsing but they are more complex to use.
If you are definitely sure that all your files have the proper formatting, the sed script can solve your problem.