0

I have a dataframe similar to the following one:

df = pd.DataFrame({'Text': ['Hello I would like to get only the date which is 12-13 December 2018 amid this text.', 'Ciao, what I would like to do is to keep dates, e.g. 11-14 October 2019, and remove all the rest.','Hi, SO can you help me delete everything but 10 January 2011. I found it hard doing it myself.']})

I would like to extract only dates from the text. The problem is that it is hard to find patterns. The only rule I can find there is: keep 2/3 objects before a four-digit number (i.e. the year).

I tried many convoluted solutions but I am not able to get what I need.

The result should look like this:

["12-13 December 2018"
"11-14 October 2019"
"10 January 2011"]

Can anyone help me?

Thanks!

Rollo99
  • 1,601
  • 7
  • 15
  • 1
    so is the text guaranteed to always be `number[-number] MonthName YYYY`? –  Feb 01 '22 at 16:01
  • 2
    given your example, `[0-9-]+ \w+ \d{4}` should work (without check on the validity of the month names or numbers) – mozway Feb 01 '22 at 16:01
  • @SembeiNorimaki Yes the format is always that one – Rollo99 Feb 01 '22 at 16:03
  • @mozway wouldn't that regex give a false positive in the text "10 times 1000 was a big number in 10 January 2000" –  Feb 01 '22 at 16:05
  • 1
    Then just use regex as mozway suggested. You can enhance the pattern by creating a or connected group of month names – wolfstter Feb 01 '22 at 16:06
  • 2
    @SembeiNorimaki yes of course there are plenty of possible false positives (thus my remark), but you could harcode the month names in the regex. Actually the best would be to use a parser – mozway Feb 01 '22 at 16:07
  • 1
    see here https://stackoverflow.com/questions/2655476/regex-to-match-month-name-followed-by-year for info about enhancing your regex with month (abbrehensions) – wolfstter Feb 01 '22 at 16:08

2 Answers2

2

If "keep 2/3 object before a four-digit number (i.e. the year)" is a reliable rule then you could use the following:

import re

data = {'Text': ['Hello I would like to get only the date which is 12-13 December 2018 amid this text.', 'Ciao, what I would like to do is to keep dates, e.g. 11-14 October 2019, and remove all the rest.','Hi, SO can you help me delete everything but 10 January 2011. I found it hard doing it myself.']}

date_strings = []
for string in data['Text']:     # loop through each string
    words = string.split()      # split string by ' ' characters
    for index in range(len(words)):
        if re.search(r'(\d){4}', words[index]):     # if the 'word' is 4 digits
            date_strings.append( ' '.join(words[index-2:index+1]) )     # extract that word & the preceeding 2
            break

print(date_strings)

To get:

['12-13 December 2018', '11-14 October 2019,', '10 January 2011.']

Some assumptions:

  • the dates are always 3 'words' long
  • the years are always at the end of the dates
  • as pointed out in the comments, the only 4-digit number in the text is the year
PangolinPaws
  • 670
  • 4
  • 10
  • 2
    Can be enchanced using month names.: https://stackoverflow.com/questions/2655476/regex-to-match-month-name-followed-by-year – wolfstter Feb 01 '22 at 16:09
1

Here is a potential solution using a regex:

from calendar import month_name
months = '|'.join(list(month_name)[1:])
df['Text'].str.extract(r'([0-9-]+ (?:%s) \d{4})' % months)[0]

alternative regex: r'((?:\d+-)?\d+ (?:%s) \d{4})' % months

output:

0    12-13 December 2018
1     11-14 October 2019
2        10 January 2011
mozway
  • 194,879
  • 13
  • 39
  • 75