2

Background:
I have the following pandas Dataframe:

enter image description here

Objective:
Each field in the tweet column contains tweets (duh!). I am trying to do two things:

  • Delete all characters from the string before 'InSight'. So all tweets would begin 'InSight sol...'
  • Extract dates from the tweets (that are present just prior to 'InSight' and save these in a new column, named 'Date'.

What I've tried:
I've tried things such as split_string = tweets_df.split("InSight", 1) but I can't seem to write any code that is OK with splitting part of a string, but rather just a delimiter.

Any advice would be grately appreciated.

William
  • 191
  • 5
  • 32

3 Answers3

0

Try using:

pandas.DataFrame.applymap Apply a function to a Dataframe elementwise.

This method applies a function that accepts and returns a scalar to every element of a DataFrame.

new_df = df.filter(['tweet']).applymap(lambda x: x[x.find('InSight'):])
dates_df = df.filter(['tweet']).applymap(lambda x: x[x.find('-') + 1:x.find('InSight')])
Mateo Lara
  • 827
  • 2
  • 12
  • 29
0

You need to assign the trimmed column back to the original column instead of doing subsetting, and also the str.replace method doesn't seem to have the to_replace and value parameter. It has pat and repl parameter instead:

example:

df["Date"] = df["Date"].str.replace("\s:00", "")

df
#   ID       Date 
#0   1  8/24/1995
#1   2   8/1/1899
0

To extract string after InSight you can use positive lookahead regex

df['text'] = df['tweet'].str.replace('.*(?=InSight)', '', regex=True)

To extract the date in the provided format, use str.extract with positive lookbehind regex

df['date'] = df['tweet'].str.extract('(?<=-)(\w{3} \d{2})')

Output

                                               tweet            text    date
0  Mars Weather@Marsweatherreport-Jul 15InSight s...  InSight sol 58  Jul 15
Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55