1

A fairly beginner-level pandas question here.

I have a DataFrame of transactions:

Customer  Date      Amount 
Angus  2009-07-18   $76.46 
Bruno  2009-07-21   $68.66 
Danno  2009-07-25   $73.52 
Chapp  2009-07-11   $56.04 
Chapp  2009-07-21   $11.30 
Frank  2009-07-07   $52.86 
Chapp  2009-07-09   $97.82 
Danno  2009-07-11   $84.98 
(etc. for thousands of lines)

I'd like to create four DataFrames from this data:

  1. For each customer, the customers name, how many transactions they've done, and the sum of the Amounts of these transactions
  2. For each customer, the date and amount of their most recent transaction.
  3. For each customer, the date and amount of their first transaction.
  4. For each customer, the date and amount of their largest (amount-wise) transaction.

Can you advise me on the appropriate code?

(Answers along the lines of "Why are you using DataFrames? You should be using ThnargLopes for this!" will be warmly received.)

Jonathan Harford
  • 361
  • 1
  • 3
  • 9

2 Answers2

0

I think a DataFrame is a great structure for your data. Whenever you're setting up for a "split-apply-combine" set of analysis steps, Pandas excels. You can write a function that assumes you only have one customer and returns a Series like you're looking for.

import pandas as pd
def trans_count(DF):
    return pd.Series({'count': len(DF),
                     'total': sum(DF['Amount'])})

Then use groupby and apply:

yourDF.groupby('Customer').apply(trans_count)

However, since each of your new DataFrames is a summary of a single customer, I would suggest writing one function that can return all of your desired results in a single Series.

untested from my phone!

Justin
  • 42,475
  • 9
  • 93
  • 111
0

OK, I've figured this out. First, we make a transaction field of ones to sum:

df["Trans"] = len(df)*[1]

We group by Customer:

cust_gp = df.groupby("Customer")

The the first one's easiest:

cust_gp.sum()

Four is also not hard:

cust_gp.max()

2 and 3 were tricky... I found a solution that seemed to work with my test data. Sort the data by Customer and Date, then aggregate by taking the first for each Customer:

df.sort(["Customer","Date"]).groupby("Customer").first()
df.sort(["Customer","Date"]).groupby("Customer").last()

...but when I ran it on my big data set, I was told that some of my Most Recent Transactions took place before the Last Transactions. Which makes no sense.

It turned out that the date field was being imported as text! So, complete solution:

df.Date = pd.to_datetime(df.Date)  # Date field should be date, not text
df = df.sort(["Customer","Date"]) 
cust_gp = df.groupby("Customer")

total_df     = cust_gp.sum()   # 1
largest_df   = cust_gp.max()   # 2
first_df     = cust_gp.first() # 3
last_df      = cust_gp.last()  # 4

I'm pleased with this, except for the "Gifts" column, which I'm sure isn't implemented in the most elegant way.

Jonathan Harford
  • 361
  • 1
  • 3
  • 9
  • Strangely similar to my first question on this site: http://stackoverflow.com/questions/9588331/simple-cross-tabulation-in-pandas – Jon Clements Aug 09 '13 at 15:25
  • Multiple aggregate functions! I did not know about that. Thanks. – Jonathan Harford Aug 12 '13 at 14:55
  • Yup - that was pretty much my reaction - knew it must be possible - just couldn't work my way through the myriad of options available to realise how... It's a fantastic library... so glad I found it – Jon Clements Aug 12 '13 at 15:24