4

I've got a Pandas dataframe (attendance_records). It contains the attendance records from a local school.

Each row has 4 columns: 1) the name of the student who was absent (name), 2) the address of the student (address), 3) the first day of absence (start), and 4) the last day of absence (end).

For example, one row could be:

Bobby   101 1st Street  9/1/2014     9/3/2014

This means that Bobby was absent each day between 9/1 and 9/3 (inclusive).

I want to unfold the table as follows:

Bobby 101 1st Street 9/1/2014
Bobby 101 1st Street 9/2/2014
Bobby 101 1st Street 9/3/2014

So far I have some code (below) that will do the job, but it's realllly slow on large tables, because it's basically going row by row through the table. Any ideas about how to make things faster?

import pandas as pd
def full_data(dataframe):
    allframe = pd.DataFrame()
    for i in dataframe.index:
        newframe = pd.DataFrame()
        newframe['dates'] = pd.date_range(dataframe.iloc[i].start, dataframe.iloc[i].end, freq = 'D')
        newframe['name'] = dataframe.iloc[i]['name']
        newframe['address'] = dataframe.iloc[i]['address']            
        allframe = allframe.append(newframe)
        if i%1000 == 0:
            print i
    return allframe

attendance_records = full_data(attendance_records)
monkeybiz7
  • 4,898
  • 5
  • 21
  • 35

2 Answers2

2

The pandas function append can be slow for larger dataframes. Instead, I would recommend storing the newframes in a python list and than using the concat function which appends all the frames only once.

import pandas as pd
def full_data(dataframe):
    allframes = []
    for i in dataframe.index:
        newframe = pd.DataFrame()
        newframe['dates'] = pd.date_range(dataframe.iloc[i].start, dataframe.iloc[i].end, freq = 'D')
        newframe['name'] = dataframe.iloc[i]['name']
        newframe['address'] = dataframe.iloc[i]['address']            
        allframes.append(newframe)
    return concat(allframes)

Note, this is as yet untested.

rhaskett
  • 1,864
  • 3
  • 29
  • 48
  • Thanks, that was helpful. I guess there isn't a built in function in Pandas to achieve this result? – monkeybiz7 Sep 17 '14 at 20:00
  • There maybe actually something a little cleaner. The solution above is just a general solution for quickly merging a bunch of dataframes. Your particular problem might be more cleanly (and quickly?) solved with a cross-product solution (http://stackoverflow.com/questions/13269890/cartesian-product-in-pandas), but I'd have to think more about how something like this would apply. – rhaskett Sep 17 '14 at 21:00
0

Perform date arithmetic on the dates, you don't need to fill in the missing days. Then groupby the name and sum the number of days.

from datetime import timedelta

data = pd.read_csv(StringIO('''Bobby,   101 1st Street,  9/1/2014,     9/3/2014'''), 
                   names=['Name', 'Address', 'Start', 'End'], parse_dates=[2, 3])
#add a day to get the number of days inclusive
data["Days"] = (data.End - data.Start) + timedelta(days=1)
data.groupby('name').sum()["Days"]
b10n
  • 1,166
  • 9
  • 8
  • Thanks for your input-- I learned a couple of pandas tricks from you. However, I'm interested in figuring out the number of students absent on each day, not the total number of absences per student. Is there a smart way to do this? – monkeybiz7 Sep 18 '14 at 02:05