Using Timegrouper '1M' to group and sum by columns is messing up my date index pandas python

Question

The bug has been found: The code snippets posted as solutions below work. The problem regarding my results was rooted in the data source (FEC.GOV). I have found it and am now moving on. Thanks a bunch for all of the time, patience, help, etc. from the community regarding this issue!

Since solutions have been suggested that work on the snippets found on the github site I am providing the following link to the original files (http://fec.gov/finance/disclosure/ftpdet.shtml#a2011_2012). I am using years 2008 to 2014, Data File: pas212.zip, Data Name: (Contributions to Candidates (and other expenditures) from Committees). As well, as the code below can be found at [https://github.com/Michae108/python-coding.git]. Thank you in advance for any help in resolving this issue. I've been working for three days on what should be a very simple task. I import and concatenate 4 "|" separated value files. Read as pd.df; set date column to date.time. This gives me the following output:

              cmte_id trans_typ entity_typ state  amount     fec_id    cand_id
date                                                                          
2007-08-15  C00112250       24K        ORG    DC    2000  C00431569  P00003392
2007-09-26  C00119040       24K        CCM    FL    1000  C00367680  H2FL05127
2007-09-26  C00119040       24K        CCM    MD    1000  C00140715  H2MD05155
2007-07-20  C00346296       24K        CCM    CA    1000  C00434571  H8CA37137

Secondly, I want to be able to group the index by one month frequency. Then I want to sum the [amount] according to [trans_typ] and [cand_id].

Here is my code for doing that:

import numpy as np
import pandas as pd
import glob

df = pd.concat((pd.read_csv(f, sep='|', header=None, low_memory=False, \
    names=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', \
    '12', '13', 'date', '15', '16', '17', '18', '19', '20', \
    '21', '22'], index_col=None, dtype={'date':str}) for f in \
    glob.glob('/home/jayaramdas/anaconda3/Thesis/FEC_data/itpas2_data/itpas2**.txt')))

df.dropna(subset=['17'], inplace=True)  
df.dropna(subset=['date'], inplace=True)  
df['date'] = pd.to_datetime(df['date'], format='%m%d%Y')
df1 = df.set_index('date')
df2 = df1[['1', '6', '7', '10', '15', '16', '17']].copy() 
df2.columns = ['cmte_id', 'trans_typ', 'entity_typ', 'state', 'amount',\
               'fec_id','cand_id']

df2['amount'] = df2['amount'].astype(float)

grouper = df2.groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ'])

df = grouper['amount'].sum()
grouper['amount'].sum().unstack().fillna(0)
#print (df.head())

Here is my output from running the code:

    trans_typ   24A     24C     24E     24F     24K     24N     24R     24Z
date    cand_id                                 
1954-07-31  S8AK00090   0   0   0   0   1000    0   0   0
1985-09-30  H8OH18088   0   0   36  0   0   0   0   0
1997-04-30  S6ND00058   0   0   0   0   1000    0   0   0

As you can see, the date column gets messed up after I run the group by. I am certain that my dates do not go further back then 2007. I have tried to do this simple task of grouping by 1 month periods, and then summing [amount] by [trans_typ] and [cand_id]. It seems that it should be simple but I have found no solution. I have read many questions on Stackoverflow, and have tried different techniques to solve the problem. Does anyone have an idea on this?

Here is a sample of my raw data if it helps:

C00409409|N|Q2|P|29992447808|24K|CCM|PERRIELLO FOR CONGRESS|IVY|VA|22945|||06262009|500|C00438788|H8VA05106|D310246|424490|||4072320091116608455
C00409409|N|Q2|P|29992447807|24K|CCM|JOHN BOCCIERI FOR CONGRESS|ALLIANCE|OH|44601|||06262009|500|C00435065|H8OH16058|D310244|424490|||4072320091116608452
C00409409|N|Q2|P|29992447807|24K|CCM|MIKE MCMAHON FOR CONGRESS|STATEN ISLAND|NY|10301|||06262009|500|C00451138|H8NY13077|D310245|424490|||4072320091116608453
C00409409|N|Q2|P|29992447808|24K|CCM|MINNICK FOR CONGRESS|BOISE|ID|83701|||06262009|500|C00441105|H8ID01090|D310243|424490|||4072320091116608454
C00409409|N|Q2|P|29992447807|24K|CCM|ADLER FOR CONGRESS|MARLTON|NJ|08053|||06262009|500|C00439067|H8NJ03156|D310247|424490|||4072320091116608451
C00435164|N|Q2|P|29992448007|24K|CCM|ALEXI FOR ILLINOIS EXPLORATORY COMMITTEE||||||06292009|1500|C00459586|S0IL00204|SB21.4124|424495|||4071620091116385529

This is too much of a pain to attempt to recreate your data. Much better if you make a small (as small as absolutely possible) example dataframe and include only code relevant to your problem. Way too much extraneous code and data above... — JohnE, Mar 11 '16 at 18:09
@ John, thank you for your suggestion. I have opened a Github repository with snippets of the raw data as well as my existing code. (https://github.com/Michae108/python-coding.git) — Collective Action, Mar 11 '16 at 20:32
I test it and with sample it works very well. Maybe try check `min` of `index` by `print df.index.min()`. — jezrael, Mar 12 '16 at 09:33
Thanks for looking into it. I've been trying to problem solve it for a could of days. I did as you suggested and sure enough .index.min returned a time stamp from 1954-07-31. But when I fdid a ctrl F through all of the raw data files I could not find a 07311954 ( the date in the raw files is m%d%Y%). I originally thought that there was a problem with the dtype but I corrected that. I really have no clue how to go forward except for maybe just parsing one file at a time instead of all of them together. Do you have any clue as to why the dates could be getting messed up on import? — Collective Action, Mar 12 '16 at 09:47
Function `Timegrouper` return the end of month. So you have to find data in range `07011954` to `07311954`, which are aggregated to `1954-07-31`. — jezrael, Mar 12 '16 at 09:52
The saga continues: @Jeszrael, I just imported each raw file separately and checked the data frame > column[date] with a print `(any(df.date == 7311954)) ` and there wan no instance of the date from 1954. I think that the dates are getting messed up with the groupby. do you have any idea what could be the problem? — Collective Action, Mar 12 '16 at 11:17
Do you try my [answer](http://stackoverflow.com/a/35956873/2901002)? It check all `1954` in column `date` — jezrael, Mar 12 '16 at 13:20

MaxU - stand with Ukraine · Answer 1 · 2016-03-12T15:58:08.770

UPDATE

I guess the problem as @jezrael has already mentioned was caused by the missing dates and following two lines:

df.dropna(subset=['17'], inplace=True)  
df.dropna(subset=['date'], inplace=True)

That's why you can first find "problematic rows" and then sanitize them (set some date, which makes sense for you):

import pandas as pd
import glob

def get_headers(fn):
    with open(fn, 'r') as f:
        for line in f:
            if ',' in line:
                return line.strip().split(',')


####################################################
# Data Dictionary - Contributions to Candidates from Committees
# http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionstoCandidates.shtml
# http://www.fec.gov/finance/disclosure/metadata/pas2_header_file.csv
#
headers_file = 'pas2_header_file.csv'

interesting_cols = ['CMTE_ID', 'TRANSACTION_TP', 'ENTITY_TP', 'STATE',
                    'TRANSACTION_DT', 'TRANSACTION_AMT', 'OTHER_ID', 'CAND_ID']

#
# rename columns rules
#
rename_cols = {
  'TRANSACTION_TP':     'trans_typ',
  'TRANSACTION_DT':     'date',
  'TRANSACTION_AMT':    'amount',
  'OTHER_ID':           'fec_id',
}

#
# all columns/headers (already renamed)
#
all_cols = [rename_cols.get(col) if col in rename_cols.keys() else col.lower()
            for col in get_headers(headers_file)]

#
# columns to use in read_csv() (already renamed)
#
cols = [rename_cols.get(col) if col in rename_cols.keys() else col.lower()
        for col in get_headers(headers_file) if col in interesting_cols]


####################################################


df = pd.concat(
        (pd.read_csv(
            f,
            sep='|',
            usecols=cols,
            header=None,
            low_memory=False,
            names=all_cols,
            index_col=None,
            parse_dates=['date'],
            date_parser=lambda x: pd.to_datetime(x, format='%m%d%Y'),
         )
         for f in glob.glob('./itpas2.txt'))
     )

# print rows where 'date' is empty
print(df[pd.isnull(df.date)])

#
# sanitize NaT/empty dates in order to prevent problems with an index in future
#
df.date.fillna(pd.Timestamp('20110101'), inplace=True)

# the rest is your code almost unchanged:
grouper = df.groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ'])
grouper['amount'].sum().unstack().fillna(0)

Empty date rows:

          cmte_id trans_typ entity_tp state date  amount     fec_id    cand_id
52372   C00317446       24K       NaN    CA  NaT    2500  C00409219  H6CA05195
57731   C00416693       24K       IND    DC  NaT    2500  C00463836  H2NM02126
58386   C00152892       24K       NaN    DC  NaT    1000  C00359034  H0MO06073
145715  C00154641       24K       IND    DC  NaT    1000  C00257337  H2CA37023
193651  C00000992       24K       NaN    MI  NaT     500  C00390724  H4MI07103
212982  C00454074       24E       ORG    CA  NaT    1138  S2TX00312  S2TX00312
212983  C00454074       24E       ORG    CA  NaT    4764  S2TX00312  S2TX00312
212984  C00454074       24E       ORG    CA  NaT    7058  S2MO00403  S2MO00403
212985  C00454074       24E       ORG    CA  NaT    5000  S2MO00403  S2MO00403
212986  C00454074       24E       ORG    CA  NaT   50003  S8WI00158  S8WI00158
212987  C00454074       24E       ORG    CA  NaT    8830  S8WI00158  S8WI00158
212988  C00454074       24E       ORG    CA  NaT   22189  S8WI00158  S8WI00158
212989  C00454074       24E       ORG    CA  NaT   11258  S8WI00158  S8WI00158
212990  C00454074       24E       ORG    CA  NaT    5000  S8WI00158  S8WI00158
212991  C00454074       24E       ORG    CA  NaT    7743  S2MO00403  S2MO00403
212992  C00454074       24E       ORG    CA  NaT   12463  S0MI00056  S0MI00056
212993  C00454074       24E       ORG    CA  NaT    2795  S8WI00158  S8WI00158
213034  C00454074       24E       ORG    CA  NaT    6431  S2IN00083  S2IN00083
213035  C00454074       24E       ORG    CA  NaT   28015  S2TX00312  S2TX00312
213036  C00454074       24E       ORG    CA  NaT    5395  S8NE00091  S8NE00091
213037  C00454074       24E       ORG    CA  NaT   19399  S2MO00403  S2MO00403
213038  C00454074       24E       ORG    CA  NaT    2540  S2IN00083  S2IN00083
213039  C00454074       24E       ORG    FL  NaT    1500  S2IN00083  S2IN00083
213040  C00454074       24E       ORG    CA  NaT    8065  S2TX00312  S2TX00312
213041  C00454074       24E       ORG    CA  NaT   11764  S2TX00312  S2TX00312
213042  C00454074       24E       ORG    CA  NaT   61214  S2TX00312  S2TX00312
213043  C00454074       24E       ORG    CA  NaT   44634  S2MO00403  S2MO00403
213044  C00454074       24E       ORG    TN  NaT   15000  S2TX00312  S2TX00312
213045  C00454074       24E       ORG    CA  NaT    5176  S2TX00312  S2TX00312
214642  C90014358       24E       NaN    VA  NaT    2000  S6MT00097  S6MT00097
214643  C90014358       24E       NaN    VA  NaT    2000  H2MT01060  H2MT01060
214644  C90014358       24E       NaN    DC  NaT     139  S6MT00097  S6MT00097
214645  C90014358       24E       NaN    DC  NaT     139  H2MT01060  H2MT01060
214646  C90014358       24E       NaN    DC  NaT     149  S6MT00097  S6MT00097
214647  C90014358       24E       NaN    DC  NaT     149  H2MT01060  H2MT01060
216428  C00023580       24E       ORG    VA  NaT    3352  P80003338  P80003338
216445  C00023580       24E       ORG    VA  NaT     250  P80003338  P80003338
216446  C00023580       24E       ORG    VA  NaT     333  P80003338  P80003338
216447  C00023580       24E       ORG    VA  NaT    2318  P80003338  P80003338
216448  C00023580       24E       ORG    VA  NaT     583  P80003338  P80003338
216449  C00023580       24E       ORG    VA  NaT    2969  P80003338  P80003338
216450  C00023580       24E       ORG    VA  NaT   14011  P80003338  P80003338
216451  C00023580       24E       ORG    VA  NaT     383  P80003338  P80003338
216452  C00023580       24E       ORG    VA  NaT     366  P80003338  P80003338
216453  C00023580       24E       ORG    VA  NaT     984  P80003338  P80003338
216454  C00023580       24E       ORG    VA  NaT     542  P80003338  P80003338
216503  C00023580       24E       ORG    VA  NaT    3077  P80003338  P80003338
216504  C00023580       24E       ORG    VA  NaT    3002  P80003338  P80003338
216505  C00023580       24E       ORG    VA  NaT    5671  P80003338  P80003338
216506  C00023580       24E       ORG    VA  NaT    3853  P80003338  P80003338
231905  C00454074       24E       ORG    CA  NaT   26049  S4WV00084  S4WV00084
231906  C00454074       24E       ORG    NC  NaT  135991  P80003353  P80003353
231907  C00454074       24E       ORG    FL  NaT    5000  P80003353  P80003353
231908  C00454074       24E       ORG    TX  NaT   12500  P80003353  P80003353
231909  C00454074       24A       ORG    TX  NaT   12500  P80003338  P80003338
234844  C00417519       24K       NaN    NY  NaT    2500  C00272633  H2NY26080
281989  C00427203       24K       NaN    DC  NaT     500  C00412304  S6MT00162
309146  C00500785       24A       NaN   NaN  NaT       0  H4FL20023  H4FL20023
310225  C00129189       24K       NaN    MI  NaT    1000  C00347476  H0MI10071

PS I've added some helper functions/variables (get_headers(), interesting_cols, rename_cols, all_cols, cols) which might help you to process different data/CSV files from fec.gov in future

Original answer based on sample data

Code for the specified "cut" sample data sets:

#import numpy as np
import pandas as pd
import glob

#dtparser = lambda x: pd.datetime.fromtimestamp(int(x))

cols = ['cmte_id', 'trans_typ', 'entity_typ', 'state',
        'date', 'amount', 'fec_id', 'cand_id']

df = pd.concat(
        (pd.read_csv(
            f,
            sep='|',
            usecols=[0, 5, 6, 9, 13, 14, 15, 16],
            header=None,
            low_memory=False,
            #names=cols,
            index_col=None,
            parse_dates=[13],
            date_parser=lambda x: pd.to_datetime(x, format='%m%d%Y'),
            #dtype={5: np.float64}
         )
         for f in glob.glob('./itpas2**github.txt'))
     )
df.columns = cols
df.trans_typ = df.trans_typ.astype('category')
#print(df.head())
#print(df.dtypes)

a = df.set_index('date').\
        groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ']).\
        agg({'amount': sum}).\
        reset_index()

print(a.pivot_table(index=['date', 'cand_id'],
                    columns='trans_typ',
                    values='amount',
                    fill_value=0,
                    aggfunc='sum').tail(10))

Output:

trans_typ             24A  24C  24E  24F   24K  24N  24R  24Z
date       cand_id
2013-02-28 S0FL00312    0    0    0    0     0    0    0    0
           S0IA00028    0    0    0    0     0    0    0    0
           S0IL00204    0    0    0    0     0    0    0    0
           S2ND00099    0    0    0    0  1000    0    0    0
           S4ME00055    0    0    0    0     0    0    0    0
           S4SC00240    0    0    0    0  5000    0    0    0
           S6MN00267    0    0    0    0     0    0    0    0
           S6NV00028    0    0    0    0  2500    0    0    0
           S6PA00100    0    0    0    0     0    0    0    0
           S8MT00010    0    0    0    0  3500    0    0    0

PS In your files there is only one single value for trans_typ 24K, so it can't be pivoted. So i have manipulated it in the CSV files so that we have different values now

It is still giving me dates that are way out of range z.b. 2023 or a year. — Collective Action, Mar 12 '16 at 11:54
@ MaxU, I tried running the code that you provided on the individual raw files (taking the concatenate out) and it still gives me problems with date — Collective Action, Mar 12 '16 at 12:01
@MichaelPerdue, can you provide a sample which would help to reproduce the problem? — MaxU - stand with Ukraine, Mar 12 '16 at 12:27
The data files are pretty large but they can be found here (http://fec.gov/finance/disclosure/ftpdet.shtml#a2011_2012). I'm using the data file pas212.zip (Contributions to Candidates (and other expenditures) from Committees) for years 2008 to 2014. — Collective Action, Mar 12 '16 at 12:36

jezrael · Accepted Answer · 2016-03-12T14:13:18.943

It was very complicated. Date_parser return error, so first column date was converted to string in read_csv. Then column date was converted to_datetime and remove all NaN values. Last you can use groupby and unstack:

import pandas as pd
import glob



#change path by your 
df = pd.concat((pd.read_csv(f, 
                            sep='|', 
                            header=None, 
                            names=['cmte_id', '2', '3', '4', '5', 'trans_typ', 'entity_typ', '8', '9', 'state', '11', 'employer', 'occupation', 'date', 'amount', 'fec_id', 'cand_id', '18', '19', '20', '21', '22'], 
                            usecols= ['date', 'cmte_id', 'trans_typ', 'entity_typ', 'state', 'employer', 'occupation', 'amount', 'fec_id', 'cand_id'],
                            dtype={'date': str}
                           ) for f in glob.glob('test/itpas2_data/itpas2**.txt')), ignore_index=True)


#parse column date to datetime
df['date'] = pd.to_datetime(df['date'], format='%m%d%Y')

#remove rows, where date is NaN
df = df[df['date'].notnull()]

#set column date to index
df = df.set_index('date')

g = df.groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ'])['amount'].sum()

print g.unstack().fillna(0)

trans_typ                24A  24C   24E  24F     24K  24N  24R  24Z
date       cand_id                                                 
2001-09-30 H2HI02110       0    0     0    0    2500    0    0    0
2007-03-31 S6TN00216       0    0     0    0    2000    0    0    0
2007-10-31 H8IL21021       0    0     0    0   -1000    0    0    0
2008-03-31 S6TN00216       0    0     0    0    1000    0    0    0
2008-07-31 H2PA11098       0    0     0    0    1000    0    0    0
           H4KS03105       0    0     0    0   49664    0    0    0
           H6KS03183       0    0     0    0    1000    0    0    0
2008-10-31 H8KS02090       0    0     0    0    1000    0    0    0
           S6TN00216       0    0     0    0    1500    0    0    0
2008-12-31 H6KS01146       0    0     0    0    2000    0    0    0
2009-02-28 S6OH00163       0    0     0    0   -1000    0    0    0
2009-03-31 S2KY00012       0    0     0    0    2000    0    0    0
           S6WY00068       0    0     0    0   -2500    0    0    0
2009-06-30 S6TN00216       0    0     0    0   -1000    0    0    0
2009-08-31 S0MO00183       0    0     0    0    1000    0    0    0
2009-09-30 S0NY00410       0    0     0    0    1000    0    0    0
2009-10-31 S6OH00163       0    0     0    0   -2500    0    0    0
           S6WY00068       0    0     0    0   -1000    0    0    0
2009-11-30 H8MO09153       0    0     0    0     500    0    0    0
           S0NY00410       0    0     0    0   -1000    0    0    0
           S6OH00163       0    0     0    0    -500    0    0    0
2009-12-31 H0MO00019       0    0     0    0     500    0    0    0
           S6TN00216       0    0     0    0   -1000    0    0    0
2010-01-31 H0CT03072       0    0     0    0     250    0    0    0
           S0MA00109       0    0     0    0    5000    0    0    0
2010-02-28 S6TN00216       0    0     0    0   -1500    0    0    0
2010-03-31 H0MO00019       0    0     0    0     500    0    0    0
           S0NY00410       0    0     0    0   -2500    0    0    0
2010-05-31 H0MO06149       0    0     0    0     530    0    0    0
           S6OH00163       0    0     0    0   -1000    0    0    0
...                      ...  ...   ...  ...     ...  ...  ...  ...
2012-12-31 S6UT00063       0    0     0    0    5000    0    0    0
           S6VA00093       0    0     0    0   97250    0    0    0
           S6WY00068       0    0     0    0    1500    0    0    0
           S6WY00126       0    0     0    0   11000    0    0    0
           S8AK00090       0    0     0    0  132350    0    0    0
           S8CO00172       0    0     0    0   88500    0    0    0
           S8DE00079       0    0     0    0    6000    0    0    0
           S8FL00166       0    0     0    0    -932    0    0  651
           S8ID00027       0    0     0    0   13000    0    0  326
           S8ID00092       0    0     0    0    2500    0    0    0
           S8MI00158       0    0     0    0    7500    0    0    0
           S8MI00281     110    0     0    0    3000    0    0    0
           S8MN00438       0    0     0    0   65500    0    0    0
           S8MS00055       0    0     0    0   21500    0    0    0
           S8MS00196       0    0     0    0     500    0    0  650
           S8MT00010       0    0     0    0  185350    0    0    0
           S8NC00239       0    0     0    0   67000    0    0    0
           S8NE00067       0   40     0    0       0    0    0    0
           S8NE00117       0    0     0    0   13000    0    0    0
           S8NJ00392       0    0     0    0   -5000    0    0    0
           S8NM00168       0    0     0    0   -2000    0    0    0
           S8NM00184       0    0     0    0   51000    0    0    0
           S8NY00082       0    0     0    0    1000    0    0    0
           S8OR00207       0    0     0    0   23500    0    0    0
           S8VA00214       0    0   120    0   -2000    0    0    0
           S8WA00194       0    0     0    0   -4500    0    0    0
2013-10-31 P80003338  314379    0     0    0       0    0    0    0
           S8VA00214   14063    0     0    0       0    0    0    0
2013-11-30 H2NJ03183       0    0  2333    0       0    0    0    0
2014-10-31 S6PA00217       0    0     0    0    1500    0    0    0

Thanks again; this also works, but, as before, I am getting the year 1954. Theoretically I should only have values between 2007 ad 2014. I'm going to try now to locate all of the values outside of that range, document them, delete them, and inform the data source (FEC.gov) of the problem. — Collective Action, Mar 12 '16 at 14:35
I try your file in archive `pas212.zip`, unzip it to `itpas2.txt` and I think there is no problem . In function `groupby` are data sorted and first index is not `1954`, but `2001-09-30` (interesting too, but `1954` is worse, I think.) Try remove all files In your directory, only `itpas2.txt` dont delete. Then try this solution and I think (and hope) result will be same - no `1954`. — jezrael, Mar 12 '16 at 14:43
I just did as you suggested ^ and the results were as you said. Thanks for your time and help.However, I need to delete all observations that are not in range 2007 - 2014. — Collective Action, Mar 12 '16 at 14:56
Can I help you with deleting? I think the best solution is remove all rows outside range `2007` and `2014` before `groupby`. — jezrael, Mar 12 '16 at 15:00
Yes Please! I posted a comment here (http://stackoverflow.com/questions/35956712/check-if-certain-value-is-contained-in-pandas-column-python/35956886#35956886) that asked the same. If you can help, it would be great! — Collective Action, Mar 12 '16 at 15:03
Ok, only one question. Data can be between `20070101` and `20140101` as `(df['date'] > '2007-01-01') & (df['date'] < '2014-01-01')` ? Or between `20070101` and `20141231` ? — jezrael, Mar 12 '16 at 15:12
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/106101/discussion-between-michael-perdue-and-jezrael). — Collective Action, Mar 12 '16 at 15:17

Using Timegrouper '1M' to group and sum by columns is messing up my date index pandas python

2 Answers2

Linked