0

I'm trying to find and extract the date and time in a column that contain text sentences. The example data is as below.

df = {'Id':  ['001', '002',...],
        'Description': ['
THERE IS AN INTERUPTION/FAILURE @ 9.6AM ON 27.1.2020 FOR JB BRANCH. THE INTERUPTION ALSO INVOLVED A, B, C AND SOME OTHER TOWN AREAS. OTC AND SST SERVICES INTERRUPTED AS GENSET ALSO WORKING AT THAT TIME. WE CALL FOR SERVICE. THE TECHNICHIAN COME AT 10.30AM. THEN IT BECOME OK AROUND 10.45AM', 'today is 23/3/2013 @10:AM we have',...],
         ....
        }

df = pd.DataFrame (df, columns = ['Id','Description'])
     

I have tried the datefinder library below but it gives todays date which is wrong.

findDate = dtf.find_dates(le['Description'][0])
for dates in findDate:
   print(dates)

Does anyone know what is the best way to extract it and automatically put it into a new column? Or does anyone know any library that can calculate duration between time and date in a string text. Thank you.

Anurag Dabas
  • 23,866
  • 9
  • 21
  • 41
Aqilah
  • 17
  • 4

1 Answers1

0

So you have two issues here.

  1. you want to know how to apply a function on a DataFrame.
  2. you want a function to extract a pattern from a bunch of text

Here is how to apply a function on a Serie (if selecting only one column as I did, you get a Serie). Bonus points: Read the DataFrame.apply() and Series.apply() documentation (30s) to become a Pandas-chad!

def do_something(x):
    some-code()

df['new_text_column'] = df['original_text_column'].apply(do_something) 

And here is one way to extract patterns from a string using regexes. Read the regex doc (or follow a course)and play around with RegExr to become an omniscient god (that is, if you use a command-line on Linux, along with your regex knowledge).

Modified from: How to extract the substring between two markers?

import re    
text = 'gfgfdAAA1234ZZZuijjk'
# Searching numbers.
m = re.search('\d+', text)
if m:
    found = m.group(0)
# found: 1234
Florian Fasmeyer
  • 795
  • 5
  • 18
  • Thank you for your help but I need to extract out all of the dates and times from the bunch of text. It can contain 2 dates and 2 clock time in the text. If you know any library that can automatically detect and find it, let me know, Thanks! appreciate it – Aqilah Mar 18 '21 at 01:24
  • Oh then, everything is already in "re" a.k.a. Regex. As you can see I use a ".group(0)" function to get my search result. The good news is that you can iterate over all of your groups to get everything that was found. Try to do it, if you can't, take 2 min to read the doc and if it takes any longer, there should already be (99.9% sure) an answer about "how to find multiple patterns in one search" or something like it... Good luck with your endeavours! :) – Florian Fasmeyer Mar 18 '21 at 18:58
  • Okay will try! Thank you :) – Aqilah Mar 19 '21 at 01:09