Group several columns then aggregate a set of columns in Pandas (It crashes badly compared to R's data.table)

Question

I am relatively new to the world of Python and trying to use it as a back-up platform to do data analysis. I generally use data.table for my data analysis needs.

The issue is that when I run group-aggregate operation on big CSV file (randomized, zipped, uploaded at http://www.filedropper.com/ddataredact_1), Python throws:

grouping pandas return getattr(obj, method)(*args, **kwds) ValueError: negative dimensions are not allowed

OR (I have even encountered...)

File "C:\Anaconda3\lib\site-packages\pandas\core\reshape\util.py", line 65, in cartesian_product for i, x in enumerate(X)] File "C:\Anaconda3\lib\site-packages\pandas\core\reshape\util.py", line 65, in for i, x in enumerate(X)] File "C:\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py", line 445, in repeat return _wrapfunc(a, 'repeat', repeats, axis=axis) File "C:\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py", line 51, in _wrapfunc return getattr(obj, method)(*args, **kwds) MemoryError

I have spent three days trying to reduce the file size (I was able to reduce the size by 89%), adding breakpoints, debugging it, but I was not able to make any progress.

Surprisingly, I thought of running the same group/aggregate operation in data.table in R, and it hardly took 1 second. Moreover, I didn't have to do any data type conversion etc., suggested at https://www.dataquest.io/blog/pandas-big-data/.

I also researched other threads: Avoiding Memory Issues For GroupBy on Large Pandas DataFrame, Pandas: df.groupby() is too slow for big data set. Any alternatives methods?, and pandas groupby with sum() on large csv file?. It seems these threads are more about matrix multiplication. I'd appreciate if you wouldn't tag this as duplicate.

Here's my Python code:

finaldatapath = "..\Data_R"
ddata = pd.read_csv(finaldatapath +"\\"+"ddata_redact.csv", low_memory=False,encoding ="ISO-8859-1")

#before optimization: 353MB
ddata.info(memory_usage="deep")

#optimize file: Object-types are the biggest culprit.
ddata_obj = ddata.select_dtypes(include=['object']).copy()

#Now convert this to category type:
#Float type didn't help much, so I am excluding it here.
for col in ddata_obj:
    del ddata[col]
    ddata.loc[:, col] = ddata_obj[col].astype('category')

#release memory
del ddata_obj

#after optimization: 39MB
ddata.info(memory_usage="deep")


#Create a list of grouping variables:
group_column_list = [
                 "Business",
                 "Device_Family",
                 "Geo",
                 "Segment",
                 "Cust_Name",
                 "GID",
                 "Device ID",
                 "Seller",
                "C9Phone_Margins_Flag",
                 "C9Phone_Cust_Y_N",
                 "ANDroid_Lic_Type",
                 "Type",
                 "Term",
                 'Cust_ANDroid_Margin_Bucket',
                 'Cust_Mobile_Margin_Bucket',
# #                'Cust_Android_App_Bucket',
                 'ANDroind_App_Cust_Y_N'
]

print("Analyzing data now...")

def ddata_agg(x):
    names = {
        'ANDroid_Margin': x['ANDroid_Margin'].sum(),
        'Margins': x['Margins'].sum(),
        'ANDroid_App_Qty': x['ANDroid_App_Qty'].sum(),
        'Apple_Margin':x['Apple_Margin'].sum(),
       'P_Lic':x['P_Lic'].sum(),
       'Cust_ANDroid_Margins':x['Cust_ANDroid_Margins'].mean(),
       'Cust_Mobile_Margins':x['Cust_Mobile_Margins'].mean(),
       'Cust_ANDroid_App_Qty':x['Cust_ANDroid_App_Qty'].mean()
    }
    return pd.Series(names)

ddata=ddata.reset_index(drop=True)

ddata = ddata.groupby(group_column_list).apply(ddata_agg)

The code crashes in above .groupby operation.

Can someone please help me? Compared to my other posts, I have probably spent the most amount of time on this StackOverflow post, trying to fix it and learn new stuff about Python. However, I have reached saturation--it even more frustrates me because R's data.table package processes this file in <2 seconds. This post is not about pros and cons of R and Python, but about using Python to be more productive.

I am completely lost, and I'd appreciate any help.

Here's my data.table R code:

path_r = "../ddata_redact.csv"
ddata<-data.table::fread(path_r,stringsAsFactors=FALSE,data.table = TRUE, header = TRUE)

group_column_list <-c(
  "Business",
  "Device_Family",
  "Geo",
  "Segment",
  "Cust_Name",
  "GID",
  "Device ID",
  "Seller",
  "C9Phone_Margins_Flag",
  "C9Phone_Cust_Y_N",
  "ANDroid_Lic_Type",
  "Type",
  "Term",
  'Cust_ANDroid_Margin_Bucket',
  'Cust_Mobile_Margin_Bucket',
  # #                'Cust_Android_App_Bucket',
  'ANDroind_App_Cust_Y_N'
  )

    ddata<-ddata[, .(ANDroid_Margin = sum(ANDroid_Margin,na.rm = TRUE), 
Margins=sum(Margins,na.rm = TRUE), 
Apple_Margin=sum(Apple_Margin,na.rm=TRUE),  
Cust_ANDroid_Margins = mean(Cust_ANDroid_Margins,na.rm = TRUE),  
Cust_Mobile_Margins = mean(Cust_Mobile_Margins,na.rm = TRUE),
Cust_ANDroid_App_Qty = mean(Cust_ANDroid_App_Qty,na.rm = TRUE),
ANDroid_App_Qty=sum(ANDroid_App_Qty,na.rm = TRUE)
), 
by=group_column_list]

I have a 4-core 16GB RAM Win10x64 machine. I can provide any details needed by experts.

Adding to Josemz's comment, here are two threads on agg vs. apply: What is the difference between pandas agg and apply function? and Pandas difference between apply() and aggregate() functions

If you're tied to python, perhaps consider using `pydatatable` instead of `pandas`: https://github.com/h2oai/datatable in benchmarks, it's found that `pandas` struggles with some memory issues that `data.table` and `pydatatable` and `spark` have no issue with... https://h2oai.github.io/db-benchmark/ — MichaelChirico, Oct 17 '18 at 08:15
@Michael Chirico: Thank you. I see that `pydatatable` is not included in Anacondas dist. I have seen a lot of people posting issues with trying to `pip install` non-Anacondas packages with problems. I am not familiar with `spark`. I am surprised that `pandas` has such blaring issues. Hope McKinney fixes these. `data.table` is just awesome. My respect for the `data.table` team has increased manyfold. I am not disparaging `pandas` team, I think they have done a great job by providing this for free, but these are critical failures. — watchtower, Oct 17 '18 at 16:13
Adding to Josemz's comment, here are two threads on `agg` vs. `apply`: https://stackoverflow.com/questions/21828398/what-is-the-difference-between-pandas-agg-and-apply-function and https://stackoverflow.com/questions/44864655/pandas-difference-between-apply-and-aggregate-functions — watchtower, Oct 25 '18 at 05:39

score 1 · Accepted Answer · answered Oct 23 '18 at 22:58

1

I think what you're looking for is agg instead of apply. You can pass a dict mapping columns to the functions you want to apply, so I think this would work for you:

ddata = ddata.groupby(group_column_list).agg({
    'ANDroid_Margin'      : sum,
    'Margins'             : sum,
    'ANDroid_App_Qty'     : sum,
    'Apple_Margin'        : sum,
    'P_Lic'               : sum,
    'Cust_ANDroid_Margins': 'mean',
    'Cust_Mobile_Margins' : 'mean',
    'Cust_ANDroid_App_Qty': 'mean'})

answered Oct 23 '18 at 22:58

josemz

1,283
7
15

Thank you so much. I am curious: do you think there is a performance bottleneck when using `agg` vs. `apply`? It sounds as if performance is an issue with `apply`... – watchtower Oct 25 '18 at 05:41
`apply` maps the function you give to it to every element of your column, preserving the size. So if you, for example, apply sqrt, it will calculate the sqrt of every element. I think what your code was doing was filling every row with the aggregation, instead of summarizing. – josemz Oct 25 '18 at 21:50

Group several columns then aggregate a set of columns in Pandas (It crashes badly compared to R's data.table)

I have a 4-core 16GB RAM Win10x64 machine. I can provide any details needed by experts.

1 Answers1