0

I have the table below contained in the DataFrame df as below :

date         val1   val2    user_id  val3      val4    val5    val6
01/01/2011  1   100 3    sterling  100     3       euro
01/02/2013  20  8        sterling  12      15      euro
01/07/2012      19  57   sterling  9       6       euro     
01/11/2014  3100    49  6        sterling  15      3       euro
21/12/2012          240  sterling  240     30      euro 
14/09/2013      21  63   sterling  34      23      euro         
01/12/2013  3200    51  20       sterling  93      56      euro

The code used in order to obtain the above table is :

import pandas as pd

myheaders= ['date','val1', 'val1','val2', 'val3','val4','user_id','val5','val6']
df = pd.read_csv('mytest.csv', names = myheaders, header = False, parse_dates=True, dayfirst=True)
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.loc[:,['date','user_id','val1','val2','val3','val4', 'val5', 'val6']]
df['date'] = pd.to_datetime(df['date'], dayfirst=True) 
df1 = df.pivot('date', 'user_id')

However, I would like to know the reason when I add the statement df2 = df1.resample ('M') at the end of the last code I obtain a datframe df2 which looks like (just fields) val1 val5 user_id date

instead being like:

        val1  val2  val3  val4  val5  val6

user_id date

Thanks in advance for your help.

Space
  • 880
  • 3
  • 11
  • 22

1 Answers1

0

You can do a resample of a groupby, provided you have a DatetimeIndex:

In [11]: df
Out[11]:
        date  val1  val2  user_id  val3  val4  val5  val6
0 2011-01-01     1   100        3     5   100     3     5
1 2013-01-02    20     8        6    12    15     3   NaN
2 2012-01-07    19    57       10     9     6     6   NaN
3 2014-01-11  3100    49        6    12    15     3   NaN
4 2012-12-21   240    30      240    30   NaN   NaN   NaN
5 2013-09-14    21    63       90    34    23     6   NaN
6 2013-01-12  3200    51       20    50    93    56   NaN

In [12]: df2 = df.set_index('date')  # now you have a DatetimeIndex

In [13]: df2
Out[13]:
            val1  val2  user_id  val3  val4  val5  val6
date
2011-01-01     1   100        3     5   100     3     5
2013-01-02    20     8        6    12    15     3   NaN
2012-01-07    19    57       10     9     6     6   NaN
2014-01-11  3100    49        6    12    15     3   NaN
2012-12-21   240    30      240    30   NaN   NaN   NaN
2013-09-14    21    63       90    34    23     6   NaN
2013-01-12  3200    51       20    50    93    56   NaN

In [14]: df2.groupby('user_id').resample('M').dropna(how='all')
Out[14]:
                    val1  val2  user_id  val3  val4  val5  val6
user_id date
3       2011-01-31     1   100        3     5   100     3     5
6       2013-01-31    20     8        6    12    15     3   NaN
        2014-01-31  3100    49        6    12    15     3   NaN
10      2012-01-31    19    57       10     9     6     6   NaN
20      2013-01-31  3200    51       20    50    93    56   NaN
90      2013-09-30    21    63       90    34    23     6   NaN
240     2012-12-31   240    30      240    30   NaN   NaN   NaN
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • I tried the above : it still provide an output only two fields instead of 6. – Space Apr 03 '14 at 18:48
  • Maybe it is due to the fact that some fields such as 'val3' and 'val6' contain string instead of real values. – Space Apr 03 '14 at 18:53
  • @user3102882 yeah, that could definitely break the resample (which uses mean under the hood), you'll notice my blank values are NaN, I strongly urge you to use those (float dtype rather than object). see: http://stackoverflow.com/a/17534682/1240268 – Andy Hayden Apr 03 '14 at 18:55
  • @user3102882 oh, I see what you mean... what do you expect the output to be with a resample of strings (e.g. what is the mean of a list of strings)? pandas can't resample non-numerics with a numeric agg like mean, so you have to pass it a different agg functions (which can handle strings). – Andy Hayden Apr 03 '14 at 19:01
  • How can I copy just 'date', 'user_id', 'val1','val2','val3' and 'val4' into a new dataframe (also without 'val5','val6')? – Space Apr 03 '14 at 19:08
  • Yeah, I believe that I have also to firstly delete the columns contaings string (which represent unit/currencies) by doing something like df.drop([Column Name or list],inplace=True,axis=1) – Space Apr 03 '14 at 19:27