Pandas - call user defined function on subset dataframe

Question

I am creating a count function on subsets of Pandas DataFrame and intends to export a dictionary/spreadsheet data that consists only of the groupby criteria and the counting results.

In [1]: df = pd.DataFrame([[Buy, A, 123, NEW, 500, 20190101-09:00:00am], [Buy, A, 124, CXL, 500, 20190101-09:00:01am], [Buy, A, 125, NEW, 500, 20190101-09:00:03am], [Buy, A, 126, REPLACE, 300, 20190101-09:00:10am], [Buy, B, 210, NEW, 1000, 20190101-09:10:00am], [Sell, B, 345, NEW, 200, 20190101-09:00:00am], [Sell, C, 412, NEW, 100, 20190101-09:00:00am], [Sell, C, 413, NEW, 200, 20190101-09:01:00am], [Sell, C, 414, CXL, 50, 20190101-09:02:00am]], columns=['side', 'sender', 'id', 'type', ''quantity', 'receive_time'])

Out[1]: 
   side  sender  id    type     quantity  receive_time 
0  Buy   A       123   NEW      500       20190101-09:00:00am
1  Buy   A       124   CXL      500       20190101-09:00:01am
2  Buy   A       125   NEW      500       20190101-09:00:03am
3  Buy   A       126   REPLACE  300       20190101-09:00:10am
4  Buy   B       210   NEW      1000      20190101-09:10:00am
5  Buy   B       345   NEW      200       20190101-09:00:00am
6  Sell  C       412   NEW      100       20190101-09:00:00am
7  Sell  C       413   NEW      200       20190101-09:01:00am
8  Sell  C       414   CXL      50        20190101-09:02:00am

The count function is as below (mydf is passed in as a subset of the dataframe):

def ordercount(mydf):
   num = 0.0
   if mydf.type == 'NEW':
      num = num + mydf.qty
   elif mydf.type == 'REPLACE':
      num = mydf.qty
   elif mydf.type == 'CXL':
      num = num - mydf.qty
   else: 
      pass
   orderdict = dict.fromkeys([mydf.side, mydf.sender, mydf.id], num)
   return orderdict

After reading the data from csv, I group it by some criteria and also sort by time:

df = pd.read_csv('xxxxxxxxx.csv, sep='|', header=0, engine='python', names=col_names)
sorted_df = df.groupby(['side', 'sender', 'id']).apply(lambda_df:_df.sort_values(by=['time']))

Then call the previously defined function on the sorted data:

print(sorted_df.agg(ordercount))

But the value error kept bumping up saying too many lines to call.

The function way of counting data may not be efficient but it is the most straightforward way that I can think of to match order types and count quantity accordingly. I expect the program to output a table where only side, sender, id and counted quantity are shown. Is there any way to achieve this? Thanks.

Expected output:

   side   sender   total_order_num   trade_date 
0  Buy    A        300               20190101
1  Buy    B        1200              20190101
2  Sell   C        250               20190101

Seems like you just need to sort your dataframe on multiple columns. `df.sort_values(['side', 'sender', 'id', 'time'])` — Scott Boston, Jul 15 '19 at 14:39
I dont think that the problem lies in the sorting but that the function cannot be called on subset — Siyu Zhuang, Jul 15 '19 at 14:42
Best if you add a small dataset and expected output to this question for better help. See [this post](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — Scott Boston, Jul 15 '19 at 14:46
`sorted_df = df.groupby(['side', 'sender', 'id'])['time'].apply(lambda df:df.sort_values())` — moys, Jul 15 '19 at 15:04
Your `ordercount` function looks like it's supposed to be applied to each row. I am not sure `.agg` is best unless you group by `type` again before using `.agg` — Buckeye14Guy, Jul 15 '19 at 15:11
Yes, I want to apply it to each row of each subset, count a total number for each subset but not for the entire dataframe. — Siyu Zhuang, Jul 15 '19 at 15:13

Buckeye14Guy · Accepted Answer · 2019-07-16T13:39:50.847

I believe your function is not easy to apply at once because you are doing different operations depending on the rows. This would be OK if you only had + and - as your operations but you replace the value at some point and then continue on with the other operations. Because of that, a loop might just be simpler or you can spend some time to have a nice function to accomplish the task.

This is what I have. All I really did was change your ordercount so that it operates directly on a subset which you can get by simply grouping. You can either sort by time before grouping or you could do it in the ordercount function. Hopefully this helps a bit.

import pandas as pd
df = pd.DataFrame([['Buy', 'A', 123, 'NEW', 500, '20190101-09:00:00am'],
                   ['Buy', 'A', 124, 'CXL', 500, '20190101-09:00:01am'],
                   ['Buy', 'A', 125, 'NEW', 500, '20190101-09:00:03am'],
                   ['Buy', 'A', 126, 'REPLACE', 300, '20190101-09:00:10am'],
                   ['Buy', 'B', 210, 'NEW', 1000, '20190101-09:10:00am'],
                   ['Buy', 'B', 345, 'NEW', 200, '20190101-09:00:00am'],
                   ['Sell', 'C', 412, 'NEW', 100, '20190101-09:00:00am'],
                   ['Sell', 'C', 413, 'NEW', 200, '20190101-09:01:00am'],
                   ['Sell', 'C', 414, 'CXL', 50, '20190101-09:02:00am']],
columns=['side', 'sender', 'id', 'type', 'quantity', 'receive_time'])

df['receive_time'] = pd.to_datetime(df['receive_time'])
df['receive_date'] = df['receive_time'].dt.date # you do not need the time stamps


def ordercount(mydf):
    mydf_ = mydf.sort_values('receive_time')[['type', 'quantity']].copy()
    num = 0
    for val in mydf_.values:
        type_, quantity = val
        # val is going to be a list like ['NEW', 500]. All I am doing above is unpack the list into two variables.
        # You can find many resources on unpacking iterables
        if type_ == 'NEW':
            num += quantity
        elif type_ == 'REPLACE':
            num = quantity
        elif type_ == 'CXL':
            num -= quantity
        else:
            pass
    return num

mydf = df.groupby(['side', 'sender', 'receive_date']).apply(ordercount).reset_index()

Output:

|----|--------|----------|---------------------|------|
|    | side   | sender   | receive_date        |    0 |
|----|--------|----------|---------------------|------|
|  0 | Buy    | A        | 2019-01-01 00:00:00 |  300 |
|----|--------|----------|---------------------|------|
|  1 | Buy    | B        | 2019-01-01 00:00:00 | 1200 |
|----|--------|----------|---------------------|------|
|  2 | Sell   | C        | 2019-01-01 00:00:00 |  250 |
|----|--------|----------|---------------------|------|

You can easily rename the column '0' as you wish. I am still not sure how your trade_date is defined. Will your data only have one date? What happens when you have more than one date? Are you taking the min?...

Edit: If you tried it with this dataframe you can see the groups with the dates working as expected.

df = pd.DataFrame([['Buy', 'A', 123, 'NEW', 500, '20190101-09:00:00am'],
                   ['Buy', 'A', 124, 'CXL', 500, '20190101-09:00:01am'],
                   ['Buy', 'A', 125, 'NEW', 500, '20190101-09:00:03am'],
                   ['Buy', 'A', 126, 'REPLACE', 300, '20190101-09:00:10am'],
                   ['Buy', 'B', 210, 'NEW', 1000, '20190101-09:10:00am'],
                   ['Buy', 'B', 345, 'NEW', 200, '20190101-09:00:00am'],
                   ['Sell', 'C', 412, 'NEW', 100, '20190101-09:00:00am'],
                   ['Sell', 'C', 413, 'NEW', 200, '20190101-09:01:00am'],
                   ['Sell', 'C', 414, 'CXL', 50, '20190101-09:02:00am'],
                   ['Buy', 'A', 123, 'NEW', 500, '20190102-09:00:00am'],
                   ['Buy', 'A', 124, 'CXL', 500, '20190102-09:00:01am'],
                   ['Buy', 'A', 125, 'NEW', 500, '20190102-09:00:03am'],
                   ['Buy', 'A', 126, 'REPLACE', 300, '20190102-09:00:10am'],
                   ['Buy', 'B', 210, 'NEW', 1000, '20190102-09:10:00am'],
                   ['Buy', 'B', 345, 'NEW', 200, '20190102-09:00:00am'],
                   ['Sell', 'C', 412, 'NEW', 100, '20190102-09:00:00am'],
                   ['Sell', 'C', 413, 'NEW', 200, '20190102-09:01:00am'],
                   ['Sell', 'C', 414, 'CXL', 50, '20190102-09:02:00am']],
columns=['side', 'sender', 'id', 'type', 'quantity', 'receive_time'])

I am running the program on many files. One file for each trade date and has many lines with different trade time within the day. — Siyu Zhuang, Jul 16 '19 at 12:32
I want to see the totals for each day. So I use sort by time value for all files and then apply the same function. But it doesn't seem to work and I keep getting reindex value error — Siyu Zhuang, Jul 16 '19 at 12:33
You should actually create an additional column with just the date. I will edit my example — Buckeye14Guy, Jul 16 '19 at 13:30
I did create an additional column with just the date. I actually use the same code: df['receive_date'] = df['receive_time'].dt.date, and put it as part of the counting function to reformat the output. Any idea when I get the reindex error when I tried to call the function on a list of data files? — Siyu Zhuang, Jul 16 '19 at 15:28
Can you add the new code in an edit to the original problem? — Buckeye14Guy, Jul 16 '19 at 15:32

Pandas - call user defined function on subset dataframe

1 Answers1