25

It seems that for OLS linear regression to work well in Pandas, the arguments must be floats. I'm starting with a csv (called "gameAct.csv") of the form:

date, city, players, sales

2014-04-28,London,111,1091.28

2014-04-29,London,100,1100.44

2014-04-28,Paris,87,1001.33

...

I want to perform linear regression of how sales depend on date (as time moves forward, how do sales move?). The problem with my code below seems to be with dates not being float values. I would appreciate help on how to resolve this indexing problem in Pandas.

My current (non-working, but compiling code):

import pandas as pd

from pandas import DataFrame, Series

import statsmodels.formula.api as sm

df = pd.read_csv('gameAct.csv')

df.columns = ['date', 'city', 'players', 'sales']

city_data = df[df['city'] == 'London']

result = sm.ols(formula = 'sales ~ date', data = city_data).fit()

As I vary the city value, I get R^2 = 1 results, which is wrong. I have also attempted index_col = 0, parse_dates == True' in defining the dataframe df, but without success.

I suspect there is a better way to read in such csv files to perform basic regression over dates, and also for more general time series analysis. Help, examples, and resources are appreciated!

Note, with the above code, if I convert the dates index (for a given city) to an array, the values in this array are of the form:

'\xef\xbb\xbf2014-04-28'

How does one produce an AIC analysis over all of the non-sales parameters? (e.g. the result might be that sales depend most linearly on date and city).

Quetzalcoatl
  • 2,016
  • 4
  • 26
  • 36
  • certainly, one solution which is unpythonic and unpandas-like is of the form: datecol = london['date']; londates = []; for x in datecol: londates.append(float(x.replace('-',''))) and then regress specified london data over the londates array – Quetzalcoatl Jul 05 '14 at 19:11

4 Answers4

42

For this kind of regression, I usually convert the dates or timestamps to an integer number of days since the start of the data.

This does the trick nicely:

df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])    
df['date_delta'] = (df['date'] - df['date'].min())  / np.timedelta64(1,'D')
city_data = df[df['city'] == 'London']
result = sm.ols(formula = 'sales ~ date_delta', data = city_data).fit()

The advantage of this method is that you're sure of the units involved in the regression (days), whereas an automatic conversion may implicitly use other units, creating confusing coefficients in your linear model. It also allows you to combine data from multiple sales campaigns that started at different times into your regression (say you're interested in effectiveness of a campaign as a function of days into the campaign). You could also pick Jan 1st as your 0 if you're interested in measuring the day of year trend. Picking your own 0 date puts you in control of all that.

There's also evidence that statsmodels supports timeseries from pandas. You may be able to apply this to linear models as well: http://statsmodels.sourceforge.net/stable/examples/generated/ex_dates.html

Also, a quick note: You should be able to read column names directly out of the csv automatically as in the sample code I posted. In your example I see there are spaces between the commas in the first line of the csv file, resulting in column names like ' date'. Remove the spaces and automatic csv header reading should just work.

Tom Q.
  • 640
  • 5
  • 9
  • Thank you. I'm currently getting this error (which stems from your third line on 'date_delta' I believe): result[mask] = op(x[mask], y) TypeError: unsupported operand type(s) for -: 'str' and 'str' Now, if I delete this 'date_delta' setting, then it compiles but the problem is that len(result.params) = # of unique dates, instead of the number of parameters for one linear regression performance. Comments welcome. – Quetzalcoatl Jul 05 '14 at 23:05
  • if it may be of use: >> pd.version – Quetzalcoatl Jul 05 '14 at 23:05
  • 1
    Hmm, that's interesting. I have the same version of Pandas as you. It sounds like your `df['date']` column is still of type `str`. The second line `df['date'] = pd.to_datetime(df['date'])` should take care of converting the strs to date objects. Maybe it's failing because of strange characters in your input csv. I noticed you pasted that some of the values look like '\xef\xbb\xbf2014-04-28', which looks to me like some garbage characters (possibly from an imperfect UTF-8 conversion) are present. I'd recommend sanitizing the input and giving it another shot. – Tom Q. Jul 06 '14 at 01:39
  • indeed, the problem is with date entries that have weird spacing, e.g. " 20140428" instead of "20140428". The former returns as ' \xef\xbb\xbf20140428' and I can't directly convert it to float and Pandas doesn't seem able to either (which is surprising?). Is there a UTF-8 trick so Pandas will clean such entries for me? I'll try some ad hoc cleaning and will get back soon – Quetzalcoatl Jul 06 '14 at 10:25
  • my ad hoc solution employed s = ''.join(j for j in x if j.isdigit()), iterating over date entries x, followed by float(s). This and your comments led me to a working solution. I'd still be interested in more elegant ways of tackling this float conversion. Thanks again – Quetzalcoatl Jul 06 '14 at 17:01
2

get date as floating point year

I prefer a date-format, which can be understood without context. Hence, the floating point year representation. The nice thing here is, that the solution works on a numpy level - hence should be fast.

import numpy as np
import pandas as pd

def dt64_to_float(dt64):
    """Converts numpy.datetime64 to year as float.

    Rounded to days

    Parameters
    ----------
    dt64 : np.datetime64 or np.ndarray(dtype='datetime64[X]')
        date data

    Returns
    -------
    float or np.ndarray(dtype=float)
        Year in floating point representation
    """

    year = dt64.astype('M8[Y]')
    # print('year:', year)
    days = (dt64 - year).astype('timedelta64[D]')
    # print('days:', days)
    year_next = year + np.timedelta64(1, 'Y')
    # print('year_next:', year_next)
    days_of_year = (year_next.astype('M8[D]') - year.astype('M8[D]')
                    ).astype('timedelta64[D]')
    # print('days_of_year:', days_of_year)
    dt_float = 1970 + year.astype(float) + days / (days_of_year)
    # print('dt_float:', dt_float)
    return dt_float

if __name__ == "__main__":

    dates = np.array([
        '1970-01-01', '2014-01-01', '2020-12-31', '2019-12-31', '2010-04-28'],
        dtype='datetime64[D]')

    df = pd.DataFrame({
        'date': dates,
        'number': np.arange(5)
        })

    df['date_float'] = dt64_to_float(df['date'].to_numpy())
    print('df:', df, sep='\n')
    print()

    dt64 = np.datetime64( "2011-11-11" )
    print('dt64:', dt64_to_float(dt64))

output

df:
        date  number   date_float
0 1970-01-01       0  1970.000000
1 2014-01-01       1  2014.000000
2 2020-12-31       2  2020.997268
3 2019-12-31       3  2019.997260
4 2010-04-28       4  2010.320548

dt64: 2011.8602739726027
Markus Dutschke
  • 9,341
  • 4
  • 63
  • 58
0

I'm not sure about the specifics of the statsmodels, but this post lists all the date/time conversions for python. They aren't always one-to-one, so it's a reference I used often ;-)

Community
  • 1
  • 1
Wyrmwood
  • 3,340
  • 29
  • 33
0
df.date.dt.total_seconds()

If the data type of your date is datetime64[ns] than dt.total_seconds() should work; this will return a number of seconds (float).

dspencer
  • 4,297
  • 4
  • 22
  • 43
dkorsakas
  • 37
  • 6