1

I am trying to group the values in the previous 10 dates for each row. I have a .csv file with columns: Date, Item, Sales, Partner and I need results like these: Date, Item, sum(sales) (of the last 10 days with sales of that particular item), so only sales of that items are grouped. The dates are not sequential (there are no sales at all on some dates, so I can't just subtract 10 days from the date). For example,

Date            ItemType    Sales        Partner
2014/01/01       A           $100          P2
2014/01/01       B           $60           P1
2014/01/04       A           $70           P2
2014/01/06       B           $80           P2 
2014/01/08       A           $40           P3
2014/01/09       B           $20           P2
...

For the above dataset, I need to get results like these:

 
Date           Item        Sales (in the last 10 dates)
2014/01/04       A          $170
2014/01/08       A          $210

I need to generate this automatically for each row in the file.
In addition, I need to get the counts for each partner (there are only 4 partners) for the last 10 dates as well:

Date             Item       P1     P2     P3    P4
2014/01/08       A          0      2      1     0

For the latter, I would probably need to group and then transpose. I guess I would need to use the Panda and/or num.py modules to do these groupings but I am new to Python and I have not been able to find an example.

5 Answers5

1

You don't need anything complicated as Panda or numpy - this is a fairly simple flow using say, the csv module.

You could do something like this:

  • read the file in using csv reader, so you have data like

    [['2014/01/01', 'A', '$100', 'P2'], ['2014/01/01', 'B', '$60', 'P1'], ]

  • Sort this data keyed on the date

  • Use itertools.groupby to group by dates

  • Use slice notation (print([1,2,3,4][:2])) to grab the 10 dates

  • Display or write to a new file using the csv module

If you have problems with any of those steps, come back and post a new question!

Community
  • 1
  • 1
Wayne Werner
  • 49,299
  • 29
  • 200
  • 290
  • Thank you for your help. I do not understand how (print([1,2,3,4][:2])) grabs the last 10 dates. – user3519466 Apr 11 '14 at 16:48
  • That was an example of slice notation - you will have to adapt it to your use case. [This question](http://stackoverflow.com/q/509211/344286) should explain all you need to know. – Wayne Werner Apr 12 '14 at 16:43
0

Assuming that you read the csv in a list and l[0] is the date field. You could use itertools this way:

from itertools import groupby

# read csv file in this list
csv_list = [
    ['2014/01/01', '100'],
    ['2014/01/01', '200'],
    ['2014/01/04', '70'],
    ['2014/01/08', '40']
]

# make sure the list is sorted by the date in order for grouping to work ok
csv_list.sort(key=lambda i: i[0])
result = [(date, sum(values[1])) for date, values in groupby(csv_list, key=lambda i: i[0])]

result should contain a list of tuples with (date, sum_of_sales in that date).

This is the part of grouping. If you want to parse the date in an actual datetime object and compare it so you can print values in a specific order use strptime.

Bogdan Iulian Bursuc
  • 2,215
  • 1
  • 16
  • 19
0

I think I understand your question. This solution will take a date, and find the sum of the last N sales before that date. Have a look at it:

import csv
from itertools import groupby, islice, ifilter
from datetime import datetime


def sum_sales(date=None, filename='my_data.csv', n_days=10, items=None):
    if date is None:
        date = datetime.today()

    with open(filename) as ifile:
        reader = csv.reader(ifile, skipinitialspace=True, delimiter=' ')
        # Skip the header
        next(reader)

        # Convenience functions to use later on
        item_date = lambda row: (row[1], datetime.strptime(row[0], '%Y/%m/%d'))

        if items is None:
            filter_by = lambda row: datetime.strptime(row[0], '%Y/%m/%d') <= date
        else:
            filter_by = lambda row: datetime.strptime(row[0], '%Y/%m/%d') <= date \
                and row[1] in items

        # Loop over groups of data, sorted by ItemType and data, grouped by
        # ItemType and filtered by filter_by
        for item, group in groupby(sorted(ifilter(
                filter_by, reader), key=item_date), lambda row: row[1]):
            partners = {'P1': 0, 'P2': 0, 'P3': 0, 'P4': 0}
            data = islice(group, n_days)
            sales = 0
            for row in data:
                sales += int(row[2].replace('$', ''))
                partners[row[3]] += 1
            print '{}\t{}\t{}'.format(date.date(), item, sales)
            print '{}\t{}\t{P1}\t{P2}\t{P3}\t{P4}'.format(date.date(), item,
                                                          **partners)

Given that your sample input is saved in my_data.csv, this would be the output:

>>> sum_sales()
2014-04-10  A   210
2014-04-10  A   0   2   1   0
2014-04-10  B   160
2014-04-10  B   1   4   1   0
>>> sum_sales(datetime(year=2014, month=1, day=4))
2014-01-04  A   170
2014-01-04  A   0   2   0   0
2014-01-04  B   60
2014-01-04  B   1   2   0   0
>>> sum_sales(datetime(year=2014, month=1, day=8), items=['A'])
2014-01-08  A   210
2014-01-08  A   0   2   1   0

You may want to direct the results to a new csv file, but that shouldn't be a problem for you.

Steinar Lima
  • 7,644
  • 2
  • 39
  • 40
  • Thank you for your help. I tried it but I get a slightly different output in the form of rows grouped by today's (not the date in the first column) date and item and 10 rows with totals for each sales/partner increment. – user3519466 Apr 10 '14 at 18:49
  • Would you mind uploading you input file, the output for running the script and your expected output in three separate pastes at [pastebin.com](http://www.pastebin.com)? – Steinar Lima Apr 10 '14 at 18:51
  • and the expected output is: 01/09/2014 A 5 2 2 1 (the last 4 are the Partner counts for the last 10 sales) – user3519466 Apr 10 '14 at 21:32
0

Seeing as you are new to python I have made a solution which doesn't incorporate third party utilities, such that you can learn python. This involves first grouping table entries into a list of dictionary of items of string and number types. I will leave the file reading to you

An empty entry from the main file would look like.

entries = [ {"Date":"", "ItemType":"", "Sales":int(), "Partner":""}, ... ]

I have made three functions to do exactly what you need.

  • filter_entries(entries, filter_key), returns a dictionary of key:entry-list items based on entries where the entry-list[filter_key] values are all the same
  • expand_entries(entries, expand_key), returns an entry-list with the entry-list[expand_key] is removed and replaced with all unique key:value pairs based on the expand_key
  • merge_entries(entries, merge_key), returns an entry-list based on entries where all items with the same entries[merge_key] value are combined

:

def filter_entries(entries, filter_key):
    unique_values = set( [e[expand_key] for e in entries])
    filtered_entries = {}
    for filter_value in unique_values:
        filtered_entries [filter_value] = [e for e in in entries if e[filter_key] == filter_value]
    return filtered_entries 

def expand_entries(entries, expand_key):
    unique_values = set( [e[expand_key] for e in entries])
    new_entries = []
    for entry in entries:
        new_entries.append({key:value for key,value in entries.items() if not key == expand_key})
        for new_key in unique_values:
            new_entries[-1][new_key] = 1 if entry[expand_key] == new_key else 0
    return new_entries

To merge entries, I would assert that string keys-values must be the same string. otherwise it is an error. ie merging proper values look like:

2014/01/01       A           100          
2014/01/01       A           60
---------------------------------     
2014/01/01       A           160   

and a bad case looks like:

2014/01/01       A           100          
2014/01/01       B           60
---------------------------------     
Value error A , B 

:

def merge_entries(entries, merge_key):
    unique_keys = set( [e[merge_key] for e in entries])
    new_entries = []
    for key in unique_keys:
        new_entry = None
        for entry in [e for e in entries if e[merge_key] == key]:
            # copy the style of the first entry with that key
            if new_entry is None:
                new_entry = {key:value for key,value in entry.items()}
                for key,value in new_entry.items():
            if not type(value) == str:
                new_entry[key] = 0.0
            for key,value in entry.items():
                if type(value) == str:
                    if not new_entry[key] == value:
                        raise Exception("Cannot merge different string for Key {}: {}, {}".format(key,value, new_entry[key] ))
                else:
                    new_entry[key] += value
        new_entries.append(new_entry)
    return new_entries

Finally with a little list comprehension the top level code is simple:

>>> entries = [ {"Date":"aaaa", "ItemType":"A", "Sales":10, "Partner":"P1"},
                {"Date":"aaaa", "ItemType":"A", "Sales":15, "Partner":"P2"},
                {"Date":"cccc", "ItemType":"A", "Sales":15, "Partner":"P2"},
                {"Date":"bbbb", "ItemType":"A", "Sales":15, "Partner":"P2"},
                {"Date":"bbbb", "ItemType":"B", "Sales":10, "Partner":"P3"},
                {"Date":"bbbb", "ItemType":"B", "Sales":15, "Partner":"P2"},
                {"Date":"cccc", "ItemType":"B", "Sales":10, "Partner":"P3"}]

>>> f_entries = filter_entries(entries, "ItemType")
>>> e_entries = {key:expand_entries(entries, "Partner") for key, entries in f_entries.items()}
>>> m_entries = {key:merge_entries(entries, "Date") for key, entries in e_entries.items()}

>>> for key in m_entries.keys():
        print key
        for entry in m_entries[key]:
            print entry  

A
{'Date': 'aaaa', 'P2': 1.0, 'P1': 1.0, 'ItemType': 'A', 'Sales': 25.0}
{'Date': 'cccc', 'P2': 1.0, 'P1': 0.0, 'ItemType': 'A', 'Sales': 15.0}
{'Date': 'bbbb', 'P2': 1.0, 'P1': 0.0, 'ItemType': 'A', 'Sales': 15.0}
B
{'Date': 'cccc', 'P2': 0.0, 'Sales': 10.0, 'ItemType': 'B', 'P3': 1.0}
{'Date': 'bbbb', 'P2': 1.0, 'Sales': 25.0, 'ItemType': 'B', 'P3': 1.0}

I'm sure with this form it wont be hard to write a table to a file!

flakes
  • 21,558
  • 8
  • 41
  • 88
0

I can't deny that it takes a bit of time to get used to pandas; what I can say is that it takes considerably less time to play around at an IPython console and find something that works than it would be to implement it from scratch.

What you're looking for is basically a groupby on the ItemType, a pivot on Partner, and then a rolling_sum. There are some slick ways to do this very concisely, but I often find it easier to understand what I'm doing if I simply break the data into groups, process those as I need to, and then reassemble the groups at the end.

Something like

import pandas as pd

df = pd.read_csv("sales.txt", delim_whitespace=True, parse_dates=[0])
df["Sales"] = df["Sales"].str.replace("$","").astype(float)

last_n_dates = 2

processed = []
grouped = df.groupby("ItemType")
for item, group in grouped:

    recent_sales = pd.rolling_sum(group["Sales"], last_n_dates, min_periods=1)

    partners = pd.crosstab(group.Date, group.Partner)
    recent_partners = pd.rolling_sum(partners, last_n_dates, min_periods=1)

    group["Sales"] = recent_sales
    del group["Partner"]
    group = group.set_index("Date")
    new_group = pd.concat([group, recent_partners], axis=1)
    processed.append(new_group)

df_final = pd.concat(processed).fillna(0)

gives me

>>> print(df_final)
           ItemType  P1  P2  P3  Sales
Date                                  
2014-01-01        A   0   1   0    100
2014-01-04        A   0   2   0    170
2014-01-08        A   0   1   1    110
2014-01-01        B   1   0   0     60
2014-01-06        B   1   1   0    140
2014-01-09        B   0   2   0    100

[6 rows x 5 columns]

Note that I deliberately set last_n_dates to 2 and not to 10, because there aren't enough values here for 10 to be interesting. However, 110 = 70+40, so that looks okay.

DSM
  • 342,061
  • 65
  • 592
  • 494
  • Thank you very much for your help. Unfortunately, in the output I get dates that are not in the source file. How is that possible? Maybe the grouping should not be by ItemType only? – user3519466 Apr 19 '14 at 11:22
  • I had mixed up the Date format, so it ordered by Day of the month first and then by month. – user3519466 Apr 19 '14 at 11:39
  • It works! Thank you so much. Is there a way to take the rolling sum/count of the previous 10 items, excluding the Sales, Partner of the current row? – user3519466 Apr 19 '14 at 11:40