3

I am trying to create a dummy file to make some ML predictions afterwards. The input are about 2000 'routes' and I want to create a dummy that contains year-month-day-hour combinations for 7 days, meaning 168 rows per route, about 350k rows in total. The problem that I am facing is that pandas becomes terribly slow in appending rows at a certain size.

I am using the following code:

DAYS = [0, 1, 2, 3, 4, 5, 6]
HODS = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

ISODOW = {
    1: "monday",
    2: "tuesday",
    3: "wednesday",
    4: "thursday",
    5: "friday",
    6: "saturday",
    7: "sunday"
}

def createMyPredictionDummy(start=datetime.datetime.now(), sourceFile=(utils.mountBasePath + 'routeProperties.csv'), destFile=(utils.outputBasePath + 'ToBePredictedTTimes.csv')):
    '''Generate a dummy file that can be used for predictions'''
    data = ['route', 'someProperties']
    dataFile = data + ['yr', 'month', 'day', 'dow', 'hod']

    # New DataFrame with all required columns
    file = pd.DataFrame(columns=dataFile)

    # Old data frame that has only the target columns    
    df = pd.read_csv(sourceFile, converters=convert, delimiter=',')
    df = df[data]

    # Counter - To avoid constant lookup for length of the DF
    ix = 0

    routes = df['route'].drop_duplicates().tolist()
    # Iterate through all routes and create a row for every route-yr-month-day-hour combination for 7 day -->  about 350k rows
    for no, route in enumerate(routes):
        print('Current route is %s which is no. %g out of %g' % (str(route), no+1, len(routes)))
        routeDF = df.loc[df['route'] == route].iloc[0].tolist()
        for i in range(0, 7):
            tmpDate = start + datetime.timedelta(days=i)
            day = tmpDate.day
            month = tmpDate.month
            year = tmpDate.year
            dow = ISODOW[tmpDate.isoweekday()]
            for hod in HODS:
                file.loc[ix] = routeDF + [year, month, day, dow, hod] # This is becoming terribly slow
                ix += 1
    file.to_csv(destFile, index=False)
    print('Wrote file')

I think the main problem lies in appending the row with .loc[] - Is there any way to append a row more efficiently? If you have any other suggestions, I am happy to hear them all!

Thanks and best,

carbee

cabeer
  • 115
  • 2
  • 6
  • 1
    This may be helpful for you https://stackoverflow.com/a/48287388/1867876 – John Karasinski Jun 21 '18 at 09:06
  • you have some testdata? – hootnot Jun 21 '18 at 09:41
  • 2
    Thanks for the link, @JohnKarasinski! It works incredibly well - Could someone provide me some insights into what this does? Since I cannot comment to other questions yet, I'll add this information here: In case you're using Python3, use io.StringIO() instead of io.BytesIO(). – cabeer Jun 21 '18 at 09:45

2 Answers2

2

(this is more of a long comment than an answer, sorry but without example data I can't run much...)

Since it seems to me that you're adding rows one at a time sequentially (i.e. the dataframe is indexed by integers accessed sequentially) and you always know the order of the columns, you're probably much better off creating a list of lists and then transforming it to a DataFrame, that is, define something like file_list = [] and then replace the line file.loc[ix] = ... by:

file_list.append(routeDF + [year, month, day, dow, hod])

In the end, you can then define

file = pd.DataFrame(file_list, columns=dataFile)

If furthermore all your data is of a fixed type (e.g. int, depending on what is your routeDF and by not converting dow until after creating the dataframe) you might be even better off by pre-allocating a numpy array and writing into it, but I'm quite sure that adding elements to a list will not be the bottleneck of your code, so this is probably excessive optimization.

Another alternative to minimize changes in your code, simply preallocate enough space by creating a DataFrame full of NaN instead of a DataFrame with no lines, i.e. change the definition of file to (after moving the line with drop_duplicates up):

file = pd.DataFrame(columns=dataFile, index=range(len(routes)*168))

I'm quite sure this is faster than your code, but it might still be slower than the list of lists approach above since it won't know which data types to expect until you fill in data (it might e.g. convert your ints to float which is not ideal). But again, once you get rid of the continuous reallocations due to expanding a DataFrame at each step, this will probably not be your bottleneck anymore (the double loop will likely be.)

Marco Spinaci
  • 1,750
  • 15
  • 22
  • Thanks for the insight! I actually solved the problem using the link provided by John Karasinski, but this provided me some more insight into what the actual problem was and I think I understand what the problem of my code snippet is. Even though I am still a bit terrified by how bad the initial approach scales. – cabeer Jun 21 '18 at 11:20
0

You create an empty dataframe named file and then you fill it by appending rows this seems the problem. If you just

def createMyPredictionDummy(...):
    ...
    # make it yield a dict of attributes from the for loop
    for hod in HODS:
        yield data

# then use this to create the *file* dataframe outside that function
newDF = pd.DataFrame([r for r in createMyPredictionDummy()])
newDF.to_csv(destFile, index=False)
print('Wrote file')
hootnot
  • 1,005
  • 8
  • 13