0

I have an HTML file that has a number of dates in this format dd/mm/yy spread all over it. I was looking for a way to retrieve specific dates from it.

input:

Released: 08/08/2019</td>
<td>06/26/2019</td>
Released: 03/09/2019</td>
<td>14/29/2019</td>

I found a way to retrieve all dates from the file:

grep -o "[0-9]\{2\}/[0-9]\{2\}/[0-9]\{4\}"

output:

08/08/2019
06/26/2019
03/09/2019
14/29/2019

However, I need to filter these dates and pick only those that have this format:

<td>dd/mm/yyyy</td>

So from the above input, I need this output:

06/26/2019
14/29/2019
Inian
  • 80,270
  • 14
  • 142
  • 161
Bogdan
  • 103
  • 1
  • 12
  • `grep -Po '\K[0-9]{2}/[0-9]{2}/[0-9]{4}'` – oguz ismail Aug 22 '19 at 06:17
  • 5
    [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Aug 22 '19 at 06:18
  • Hold on: Neither of those two represents a date in the specified format! It's like the 31st of February, those months don't exist. In either case, just read the manual: You can pipe the file into `grep` and tell it to exclude everything that matches a certain pattern. That said, HTML can have linebreaks in various locations, perhaps what you want is a tool that is capable of understanding HTML (or perhaps XML if that's what it is). – Ulrich Eckhardt Aug 22 '19 at 06:19
  • 1
    @oguzismail this is just what I was looking for – Bogdan Aug 22 '19 at 06:27
  • `dd/mm/yyyy` how do you get month `26` and `29`? How do you see if `02/03/2019` is `dd/mm/yyyy` or `mm/dd/yyyy`? – Jotne Aug 22 '19 at 10:47

2 Answers2

1

I always recommend using an HTML/XML parser. If this is not possible try GNU grep and a Perl-compatible regular expression (PCRE):

grep -Po '(?<=<td>)[0-9]{2}/[0-9]{2}/[0-9]{4}(?=</td>)' file

Output:

06/26/2019
14/29/2019
Cyrus
  • 84,225
  • 14
  • 89
  • 153
0

This gnu awk may do?

awk -F"</?td>" '/^<td>/{print $2}' file
06/26/2019
14/29/2019
Jotne
  • 40,548
  • 12
  • 51
  • 55