1

I have a column in Pandas dataframe 'Comment Text' which contains the dates in this format(showing here, only first observation):

7/09/2018 11:59:37 AM;12:01:33 PM;00:01:56

Adding Dataframe sample:

df = pd.DataFrame({'Common Text':['7/09/2018 11:59:37 AM;12:01:33 PM;00:01:56', 'Adams Gill Christ  4 Oct 2017    02:52 PM', '4/08/2017 4:30:49 PM ;4:37:23 PM;00:06:34', '5/07/2018 10:14:03 AM ;10:21:35 AM;00:07:31', 'the call was made on 20 Jun 2017\nbut call not found on system', 'Call made on 7/03/2018 8:22:25 AM', 'Review is during 30 May to 1 March 2018']})

But when I did something like this:

import datefinder
FD = datefinder.find_dates(df['Comment Text'][0])

for dates in FD:
    print(dates)

I got the following result:

2018-07-09 11:59:37
2019-06-20 12:01:33
2019-06-20 00:01:56

Which is not correct as I was expecting only 2018-07-09 to return as result.

loveR
  • 489
  • 4
  • 12

1 Answers1

1

If I understand you correctly and your data always has the two date structures you've shown as example. You can use regex.

# Make example data
df = pd.DataFrame({'Common Text':['7/09/2018 11:59:37 AM;12:01:33 PM;00:01:56', 
                                  'Adams Gill Christ  4 Oct 2017    02:52 PM', 
                                  '4/08/2017 4:30:49 PM ;4:37:23 PM;00:06:34', 
                                  '5/07/2018 10:14:03 AM ;10:21:35 AM;00:07:31', 
                                  'the call was made on 20 Jun 2017\nbut call not found on system']})

                                         Common Text
0         7/09/2018 11:59:37 AM;12:01:33 PM;00:01:56
1          Adams Gill Christ  4 Oct 2017    02:52 PM
2          4/08/2017 4:30:49 PM ;4:37:23 PM;00:06:34
3        5/07/2018 10:14:03 AM ;10:21:35 AM;00:07:31
4  the call was made on 20 Jun 2017\nbut call not...

Use str.extract.

s = df['Common Text'].str.extract('(.+?(?=\s\d{1,2}:\d{2}:\d{2}))|(\d{1,2}\s[A-Za-z]{3}\s\d{4})')
df['Date'] = s[0].fillna(s[1])



                                         Common Text         Date
0         7/09/2018 11:59:37 AM;12:01:33 PM;00:01:56    7/09/2018
1          Adams Gill Christ  4 Oct 2017    02:52 PM   4 Oct 2017
2          4/08/2017 4:30:49 PM ;4:37:23 PM;00:06:34    4/08/2017
3        5/07/2018 10:14:03 AM ;10:21:35 AM;00:07:31    5/07/2018
4  the call was made on 20 Jun 2017\nbut call not...  20 Jun 2017

Explanation:

  • (.+?(?=\s\d{1,2}:\d{2}:\d{2})): Extract everything before the pattern of time, which is 99:99:99
  • (\d{1,2}\s[A-Za-z]{3}\s\d{4}): Extract the pattern: one or two numbers, space, 3 letters, space, 4 numbers
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • Then please extend your example data more with other examples. – Erfan Jun 20 '19 at 13:34
  • See edit, it captures the two patterns of your date. @loveR – Erfan Jun 20 '19 at 14:47
  • thanks. nice work. but I was wondering why it returned `NaN` for `Review Period from 29/10/2019` – loveR Jun 20 '19 at 15:15
  • I dont see `29/10/2019` in your example data? @loveR – Erfan Jun 20 '19 at 15:29
  • I tried this... `str.extract('((?:(?:[0-2]?\\d{1})|(?:[3][01]{1}))[-:\\/.](?:[0]?[1-9]|[1][012])[-:\\/.](?:(?:[1]{1}\\d{1}\\d{1}\\d{1})|(?:[2]{1}\\d{3})))(?![\\d])|(.+?(?=\s\d{1,2}:\d{2}:\d{2}))|(\d{1,2}\s[A-Za-z]{3}\s\d{4})')` – loveR Jun 20 '19 at 15:35
  • Please edit your example data so it includes the problem which my regular expression doesnt capture @loveR – Erfan Jun 20 '19 at 15:37
  • In the last line, what do you expect to extract here, `1 march 2018`? – Erfan Jun 20 '19 at 16:25