1

I am rather new to Python, and VERY new to Pandas (I am having a more difficult time learning Pandas than Python).

I am trying to transform a large dataset, and I am stuck.

  1. I upload data from a CSV that has the following structure.

    DATE       ID      CATEGORY 1      SUCCESS (0 or 1)
    1/1/2015   a1      x               0
    1/1/2015   a2      y               0
    1/1/2015   a3      z               0
    1/3/2015   a2      z               0
    1/5/2015   a1      x               0
    1/7/2015   a2      z               1
    1/9/2015   a3      y               0
    1/10/2015  a2      z               1
    1/11/2015  a3      y               0
    

My end goal is find a way to group this into form where I can get the series of categories leading up to a success flag for a specific ID, then an array of the time elapsed during from the previous row the same ID.

So a result would something like:

    {a2: {'1':((y,z,z),(0,2,4)), '2':((z),(0))}

I am not sure if Pandas' or NumPy's multidimensional arrays would be better suited for the task. I am also not sure what functions to play around with more in Pandas to accomplish this.

A point in the right direction would be greatly helpful.

1 Answers1

1

I do not 100% understand the question. I am unsure what the (0,2,4) means. Ok let's make a start.

This is a non pandas-esque way, what with all the dataframe looping.

I have your data in csv so load it as follows:

import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
print df.head(len(df))

and looks like:

        DATE  ID CATEGORY  SUCCESS
0   1/1/2015  a1        x        0
1   1/1/2015  a2        y        0
2   1/1/2015  a3        z        0
3   1/3/2015  a2        z        0
4   1/5/2015  a1        x        0
5   1/7/2015  a2        z        1
6   1/9/2015  a3        y        0
7  1/10/2015  a2        z        1
8  1/11/2015  a3        y        0

And now the code:

uber_dict = {}
grouped =  df.groupby(['ID'])
for name, group in grouped:
    mini_dict = {}
    working_list = []
    i = 1
    for idx,row in group.iterrows():
        working_list.append((row['CATEGORY'],row['DATE']))
        if row['SUCCESS'] == 1:
            df = pd.DataFrame(working_list,columns= ['CATEGORY','DATE']) # let's use a bit more pandas.
            df.set_index('CATEGORY',inplace=True)
            df["DATE"] = pd.to_datetime(df["DATE"]) # my dates are string format so casting
            df['DATE'] = df['DATE'].diff()
            df['DATE'].fillna(0,inplace=True)
            df['DATE'] = df['DATE'].astype('timedelta64[D]').astype('int')
            mini_dict[str(i)] = tuple(zip(*df.to_records()))
            working_list=[]
            i = i +1

    if mini_dict.keys(): # see http://stackoverflow.com/questions/53513/best-way-to-check-if-a-list-is-empty
        uber_dict[name] = mini_dict

print uber_dict

produces

{'a2': {'1': (('y', 'z', 'z'), (0, 2, 4)), '2': (('z',), (0,))}}
Dickster
  • 2,969
  • 3
  • 23
  • 29
  • wow, that worked. To clarify, the second array is a time delta. I updated your code to give the day delta from the first instance of the unique ID, then to reset after a success. Thanks! the real data file is pretty huge, so I have to figure out how to make this fit into memory. But this is awesome, I now have a direction to keep learning toward. – Wilson Matos Sep 17 '15 at 16:51