0

I have a paragraph that contains details like date and comments that I need to extract and make a separate column. The paragraph is in a column from which I am extracting the date is as follows:

'Story\nFAQ\nUpdates 2\nComments 35\nby Antaio Inc\nMar 11, 2019 • 3:26AM\n2 years ago\nThank you all for an amazing start!\nHi all,\nWe just want to thank you all for an awesome start! This is our first ever Indiegogo campaign and we are very grateful for your support that helped us achieve a successful campaign.\nIn the next little while, we will be dedicating our effort on production and shipping of the awesome A-Buds and A-Buds SE. We plan to ship them to you as promised in the coming month.\nWe will send out more updates as we are approaching the key production dates.\nStay tuned!\nBest regards,\nAntaio Team\nby Antaio Inc\nJan 31, 2019 • 5:15AM\nover 2 years ago\nPre-Production Update\nDear all,\nWe want to take this opportunity to thank all of you for being our early backers. You guys rock! :)\nAs you may have noticed, the A-Buds are already in production stage, which means we have already completed all development and testing, and are now working on pre-production. Not only will you receive fully tested and certified awesome A-Buds after the campaign, we are also giving you the promise to deliver them on time! We are truly excited to have these awesome true Bluetooth 5.0 earbuds in your hands. We are sure you will love them!\nSo here is a quick sneak peek:\nMore to come. Stay tuned! :)\nFrom: Antaio Team\nRead More'

This kind of paragraph is present in each row of the dataset in a particular column called 'Project_Updates_Description'. I am trying to extract the first date in each entry

The code I'm using so far is:

for i in df['Project_Updates_Description']:
if type(i) == str: 
    print(count)
    word = i.split('\n',7)
    count+=1
    if len(word) > 5:
        print(word[5])
        df['Date'] = word[5]

The issue I have right now is that when I extract the date from the paragraph I'm getting it as string I need it as dd/mm/yyyy format I tried the methods like strptime it didn't work it is appending as string and when i try to append it in new 'Date' column I keep getting the same date for all entry. Could someone tell me were I am going wrong?

Anurag Dabas
  • 23,866
  • 9
  • 21
  • 41

1 Answers1

1

Assuming you have a dataframe with a column entitled 'Project_Updates_Description' which contains the example text and you want to extract the first date and generate a datetime stamp from this information you can do the following:

import pandas as pd
import numpy as np
def findDate(txin):
    schptrn = '^\w+ \d{1,2}, \d{4,4}'
    lines = txin.split('\n')
    for line in lines:
        #print(line)
        data = re.findall(schptrn, line)[0]
        if data:
            #print(data)
            return pd.to_datetime(data)
    return np.nan  
df['date'] = df.apply(lambda row: findDate(row['Project_Updates_Description']), axis = 1)
itprorh66
  • 3,110
  • 4
  • 9
  • 21
  • Yes, because the output of the re.findall is a list of matching phrases and you want the first one – itprorh66 May 22 '21 at 14:02
  • 1
    yes, I dropped the [0] when I copied over my solution, I have updated my answer – itprorh66 May 22 '21 at 14:13
  • hey what i got by executing your suggested code was something like this DatetimeIndex(['2019-03-11'], dtype='datetime64[ns]', freq=None) how do i change it to just get 2019-03-11 in my date column? – rylynn_mcbos May 28 '21 at 06:27
  • When you say adjust to get 2019-03-11 what do you mean? The datetime index is giving you that date. Do mean you want the date in the form of a string '2019-03-11'? – itprorh66 May 28 '21 at 12:55
  • yes, but i got the solution to that now. Thanks a lot for replying though :) – rylynn_mcbos May 28 '21 at 20:01