1

I am using an AirBnb dataset. I have a column, 'host_since'. The column contains date objects in the format of 'DD/MM/YYYY': for example, 24/09/2008. The columns' data shows the date that an individual became a host.

I want to create a new column in my dataframe that contains the the number of days since the host first joined. I am aware of the to_datetime method but cant quite understand the documentation.

note: df['host_days'] does not exist in the df. I want to create a column then assign the appropriate steps to get my outcome

def add_host_days(df):    
    df['host_days'] = df['host_since'].to_datetime()
  return df

Any ideas on how to do so? Thank you for your input.

DropKick
  • 89
  • 1
  • 8
  • 3
    [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – It_is_Chris May 28 '21 at 15:25
  • 2
    I think you need `pd.to_datetime(df['host_since'])` as per https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html . To debug you could take one of the example datetimes (e.g. use `df['host_since'].head()`) and pass it into `pd.to_datetime(my_copied_dt)` to check that the format is converted correctly - check for `dayfirst` and `yearfirst` – Ian Ozsvald May 28 '21 at 15:50
  • yeah, my approach was somewhat similar, but wrapping it up within one function is something im am not good at. The to_datetime is great but the documentation is something i spent an hour or two playing with and still couldnt quite graps it. – DropKick May 28 '21 at 16:03

2 Answers2

2

You can try this:

def add_host_days(df):    
    df['host_days'] = (pd.Timestamp.now() - pd.to_datetime(df['host_since'], dayfirst=True)).dt.days

    # If you original date fields have invalid dates and would like this number of days to be in integer:
    df['host_days'] = df['host_days'].astype('Int64')   

    return df

Demo

Suppose you have a dataframe like this:

  guest_id  host_since
0    A0001  24/09/2008
1    A0002  25/09/2008
2    A0003  29/09/2008
3    A8788  20/05/2021

Then you run the code:

new_df = add_host_days(df)

Result:

print(new_df)


  guest_id  host_since  host_days
0    A0001  24/09/2008       4629
1    A0002  25/09/2008       4628
2    A0003  29/09/2008       4624
3    A8788  20/05/2021          8
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • you're my guardian angel , you special person. I am getting the days as a float however. Assuming ```.astype(int)``` will fix this – DropKick May 28 '21 at 16:00
  • 1
    @DropKick Haha, you are most welcome! :-) Nice programming to you! – SeaBean May 28 '21 at 16:01
  • @DropKick The return types of `dt.days` should be integer. You got float probably some elements in the column have NaN values. See if you have some invalid dates in your original `host_since` column. – SeaBean May 28 '21 at 16:04
  • there seems to be missing entries, empty strings to be exact – DropKick May 28 '21 at 16:07
  • @DropKick In that case you can use .astype('Int64') to cast the column to integer. – SeaBean May 28 '21 at 16:09
  • @DropKick See my edit above, with `.astype('Int64') at the end of the main line of code. – SeaBean May 28 '21 at 16:11
  • 1
    Yeah, so the problem is the same as the day before. I just added a new line within the function ``` df['host_days'] = df['host_days'].astype('Int64') ``` and it seems to work. Thanks again for your help! I only have one more function I need to create and my preprocessing is finally finished haha! I'll give it go before i post here. – DropKick May 28 '21 at 16:12
  • @DropKick Yes, this is also ok, adding the type casting in separate line – SeaBean May 28 '21 at 16:13
2

Couldn't test it as you haven't given any reproducible code but you can subtract host_since from current date and capture the days like -

def add_host_days(df):    
    df['host_days'] = df.apply(lambda x: (datetime.datetime.today() - x.host_since).days, axis=1)
    return df
SunilG
  • 347
  • 1
  • 4
  • 10