4

I'm looking for a faster approach to improve the performance of my solution for the following problem: a certain DataFrame has two columns with a few NaN values in them. The challenge is to replace these NaNs with values from a secondary DataFrame.

Below I'll share the data and code used to implement my approach. Let me explain the scenario: merged_df is the original DataFrame with a few columns and some of them have rows with NaN values:

enter image description here

As you can see from the image above, columns day_of_week and holiday_flg are of particular interest. I would like to fill the NaN values of these columns by looking into a second DataFrame called date_info_df, which looks like this:

enter image description here

By using the values from column visit_date in merged_df it is possible to search the second DataFrame on calendar_date and find equivalent matches. This method allows to get the values for day_of_week and holiday_flg from the second DataFrame.

The end result for this exercise is a DataFrame that looks like this:

enter image description here

You'll notice the approach I'm using relies on apply() to execute a custom function on every row of merged_df:

  • For every row, search for NaN values in day_of_week and holiday_flg;
  • When a NaN is found on any or both of these columns, use the date available in from that row's visit_date to find an equivalent match in the second DataFrame, specifically the date_info_df['calendar_date'] column;
  • After a successful match, the value from date_info_df['day_of_week'] must be copied into merged_df['day_of_week'] and the value from date_info_df['holiday_flg'] must also be copied into date_info_df['holiday_flg'].

Here is a working source code:

import math
import pandas as pd
import numpy as np
from IPython.display import display

### Data for df
data = { 'air_store_id':     [              'air_a1',     'air_a2',     'air_a3',     'air_a4' ], 
         'area_name':        [               'Tokyo',       np.nan,       np.nan,       np.nan ], 
         'genre_name':       [            'Japanese',       np.nan,       np.nan,       np.nan ], 
         'hpg_store_id':     [              'hpg_h1',       np.nan,       np.nan,       np.nan ],          
         'latitude':         [                  1234,       np.nan,       np.nan,       np.nan ], 
         'longitude':        [                  5678,       np.nan,       np.nan,       np.nan ],         
         'reserve_datetime': [ '2017-04-22 11:00:00',       np.nan,       np.nan,       np.nan ], 
         'reserve_visitors': [                    25,           35,           45,       np.nan ], 
         'visit_datetime':   [ '2017-05-23 12:00:00',       np.nan,       np.nan,       np.nan ], 
         'visit_date':       [ '2017-05-23'         , '2017-05-24', '2017-05-25', '2017-05-27' ],
         'day_of_week':      [             'Tuesday',  'Wednesday',       np.nan,       np.nan ],
         'holiday_flg':      [                     0,       np.nan,       np.nan,       np.nan ]
       }

merged_df = pd.DataFrame(data)
display(merged_df)

### Data for date_info_df
data = { 'calendar_date':     [ '2017-05-23', '2017-05-24', '2017-05-25', '2017-05-26', '2017-05-27', '2017-05-28' ], 
         'day_of_week':       [    'Tuesday',  'Wednesday',   'Thursday',     'Friday',   'Saturday',     'Sunday' ], 
         'holiday_flg':       [            0,            0,            0,            0,            1,            1 ]         
       }

date_info_df = pd.DataFrame(data)
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date']) 
display(date_info_df)

# Fix the NaN values in day_of_week and holiday_flg by inspecting data from another dataframe (date_info_df)
def fix_weekday_and_holiday(row):
    weekday = row['day_of_week']   
    holiday = row['holiday_flg']

    # search dataframe date_info_df for the appropriate value when weekday is NaN
    if (type(weekday) == float and math.isnan(weekday)):
        search_date = row['visit_date']                               
        #print('  --> weekday search_date=', search_date, 'type=', type(search_date))        
        indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
        idx = indexes[0]                
        weekday = date_info_df.at[idx,'day_of_week']
        #print('  --> weekday search_date=', search_date, 'is', weekday)        
        row['day_of_week'] = weekday        

    # search dataframe date_info_df for the appropriate value when holiday is NaN
    if (type(holiday) == float and math.isnan(holiday)):
        search_date = row['visit_date']                               
        #print('  --> holiday search_date=', search_date, 'type=', type(search_date))        
        indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
        idx = indexes[0]                
        holiday = date_info_df.at[idx,'holiday_flg']
        #print('  --> holiday search_date=', search_date, 'is', holiday)        
        row['holiday_flg'] = int(holiday)

    return row


# send every row to fix_day_of_week
merged_df = merged_df.apply(fix_weekday_and_holiday, axis=1) 

# Convert data from float to int (to remove decimal places)
merged_df['holiday_flg'] = merged_df['holiday_flg'].astype(int)

display(merged_df)

I did a few measurements so you can understand the struggle:

  • On a DataFrame with 6 rows, apply() takes 3.01 ms;
  • On a DataFrame with ~250000 rows, apply() takes 2min 51s.
  • On a DataFrame with ~1215000 rows, apply() takes 4min 2s.

How do I improve the performance of this task?

karlphillip
  • 92,053
  • 36
  • 243
  • 426
  • Here is the large [merged_df dataset](https://down.uploadfiles.io/get/o460w) and the large [date_info_df dataset](https://ufile.io/tp1ak) if you would like to test the robustness of your solution. – karlphillip Mar 02 '18 at 05:08

3 Answers3

4

you can use Index to speed up the lookup, use combine_first() to fill NaN:

cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
merged_df[cols] = merged_df[cols].combine_first(
    date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))

print(merged_df[cols])

the result:

 day_of_week  holiday_flg
0     Tuesday          0.0
1   Wednesday          0.0
2    Thursday          0.0
3    Saturday          1.0
HYRY
  • 94,853
  • 25
  • 187
  • 187
  • This is better than my solution +1. Would you mind adding an explanation on how `combine_first` works / can be used? – jpp Mar 02 '18 at 03:32
  • I will run this on a larger dataset and come back later with the runtime stats. – karlphillip Mar 02 '18 at 03:43
  • I was able to run your code with the large dataset and it executed @ 456 ms. This is the fastest approach so far, congratulations! And thank you! – karlphillip Mar 02 '18 at 16:11
1

This is one solution. It should be efficient as there is no explicit merge or apply.

merged_df['visit_date'] = pd.to_datetime(merged_df['visit_date']) 
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date']) 

s = date_info_df.set_index('calendar_date')['day_of_week']
t = date_info_df.set_index('day_of_week')['holiday_flg']

merged_df['day_of_week'] = merged_df['day_of_week'].fillna(merged_df['visit_date'].map(s))
merged_df['holiday_flg'] = merged_df['holiday_flg'].fillna(merged_df['day_of_week'].map(t))

Result

  air_store_id area_name day_of_week genre_name  holiday_flg hpg_store_id  \
0       air_a1     Tokyo     Tuesday   Japanese          0.0       hpg_h1   
1       air_a2       NaN   Wednesday        NaN          0.0          NaN   
2       air_a3       NaN    Thursday        NaN          0.0          NaN   
3       air_a4       NaN    Saturday        NaN          1.0          NaN   

   latitude  longitude     reserve_datetime  reserve_visitors visit_date  \
0    1234.0     5678.0  2017-04-22 11:00:00              25.0 2017-05-23   
1       NaN        NaN                  NaN              35.0 2017-05-24   
2       NaN        NaN                  NaN              45.0 2017-05-25   
3       NaN        NaN                  NaN               NaN 2017-05-27   

        visit_datetime  
0  2017-05-23 12:00:00  
1                  NaN  
2                  NaN  
3                  NaN  

Explanation

  • s is a pd.Series mapping calendar_date to day_of_week from date_info_df.
  • Use pd.Series.map, which takes pd.Series as an input, to update missing values, where possible.
jpp
  • 159,742
  • 34
  • 281
  • 339
  • @karlphillip, I've added some explanation. – jpp Mar 02 '18 at 03:15
  • Now it seems legit! Good Job! I will run this on a larger dataset and come back later with the runtime stats. – karlphillip Mar 02 '18 at 03:29
  • I tried running your code on the [larger dataset](https://uploadfiles.io/) but it yelled an error. You don't need to take a look into this, but your solution would be more robust if you did. – karlphillip Mar 02 '18 at 04:09
1

Edit: one can also use merge to solve the problem. 10 times faster than the old approach. (Need to make sure "visit_date" and "calendar_date" are of the same format.)

# don't need to `set_index` for date_info_df but select columns needed.
merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]], 
                left_on="visit_date", 
                right_on="calendar_date", 
                how="left") # outer should also work

The desired result will be at "day_of_week_y" and "holiday_flg_y" column right now. In this approach and the map approach, we don't use the old "day_of_week" and "holiday_flg" at all. We just need to map the results from data_info_df to merged_df.

merge can also do the job because data_info_df's data entries are unique. (No duplicates will be created.)


You can also try using pandas.Series.map. What it does is

Map values of Series using input correspondence (which can be a dict, Series, or function)

# set "calendar_date" as the index such that 
# mapping["day_of_week"] and mapping["holiday_flg"] will be two series
# with date_info_df["calendar_date"] as their index.
mapping = date_info_df.set_index("calendar_date")

# this line is optional (depending on the layout of data.)
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)

# do replacement here.
merged_df["day_of_week"] = merged_df.visit_date.map(mapping["day_of_week"])
merged_df["holiday_flg"] = merged_df.visit_date.map(mapping["holiday_flg"])

Note merged_df.visit_date originally was of string type. Thus, we use

merged_df.visit_date = pd.to_datetime(merged_df.visit_date)

to make it datetime.

Timings date_info_df dataset and merged_df provided by karlphillip.

date_info_df = pd.read_csv("full_date_info_data.csv")
merged_df = pd.read_csv("full_data.csv")   
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
date_info_df.calendar_date = pd.to_datetime(date_info_df.calendar_date)
cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)

# merge method I proprose on the top.
%timeit merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]], left_on="visit_date", right_on="calendar_date", how="left")
511 ms ± 34.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# HYRY's method without assigning it back
%timeit merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
772 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# HYRY's method with assigning it back
%timeit merged_df[cols] = merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))    
258 ms ± 69.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

One can see that HYRY's method runs 3 times faster if assigning the result back to the merged_df. This is why I thought HARY's method was faster than mine at first glance. I suspect that is because of the nature of combine_first. I guess that the speed of HARY's method will depend on how sparse it is in merged_df. Thus, while assigning the results back, the columns become full; therefore, while rerunning it, it is faster.

The performances of the merge and combine_first methods are nearly equivalent. Perhaps there can be circumstances that one is faster than another. It should be left to each user to do some tests on their datasets.

Another thing to note between the two methods is that the merge method assumed every date in merged_df is contained in data_info_df. If there are some dates that are contained in merged_df but not data_info_df, it should return NaN. And NaN can override some part of merged_df that originally contains values! This is when combine_first method should be preferred. See the discussion by MaxU in Pandas replace, multi column criteria

Tai
  • 7,684
  • 3
  • 29
  • 49
  • 1
    I'll run this on a larger dataset and come back later with the runtime stats. – karlphillip Mar 02 '18 at 03:52
  • @karlphillip did you add this line `merged_df.visit_date = pd.to_datetime(merged_df.visit_date)`? See my explanation above. To matches the type/value of `merged_df.visit_date` and `mapping.index`, we need this line. – Tai Mar 02 '18 at 04:18
  • @karlphillip can you check the type of `date_info_df["calendar_date"]`? In your example code, you make it datetime object. You have to make it consistent between this two df to make things work. Let both be string or both to be datatime object in the same time. – Tai Mar 02 '18 at 04:24
  • The answer I get from `date_info_df.dtypes` on `calendar_date` is `object` – karlphillip Mar 02 '18 at 04:27
  • @karlphillip then remove the second line of mine and it shall work. Your `calendar_date` is of string. – Tai Mar 02 '18 at 04:27
  • @karlphillip are you sure that the `date_info_df` contains all the date that is in `merged_df`? Also, I don't have access to the dataset and cannot see what's going on. – Tai Mar 02 '18 at 04:57
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/166071/discussion-between-karlphillip-and-tai). – karlphillip Mar 02 '18 at 04:59
  • 1
    After all your help, I was able to run your code with the large dataset and it executed @ **5.86 s**, which is pretty incredible. Thank you! – karlphillip Mar 02 '18 at 06:33
  • @karlphillip final thoughts. I think this problem can also be solved using `merge` as well. The information is self-contained and unique in `data_info_df`. Your large data set takes 0.7 s to run the new code on my computer. – Tai Mar 02 '18 at 07:38
  • After testing all solutions, only yours and @HYRY were robust enough to work with the large datasets.Thank you! As his approach ended up rendering faster results, I decided to *accept* his answer to be consistent with the question asked. I am very grateful for all your help! – karlphillip Mar 02 '18 at 16:16
  • @karlphillip I am fine who is the accepted answer :P Your result, however, is different from mine previous testing. At first, I thought HARY's method was faster, but then I realized that I did something wrong while testing. My test code and hypothesis are added to the answer. You can take a look at it if interested. – Tai Mar 02 '18 at 20:29
  • 1
    I appreciate your investigation on the performance of both answers. I ran it again several times and the results I got are the following: your code stays near `671 ms ± 8.5 ms per loop` and HYRY's timings are very close to each other: the first stays near `192 ms ± 1.42 ms per loop` and the second at `220 ms ± 3.94 ms per loop`. Your answer deserves a lot more upvotes than it has right now. – karlphillip Mar 03 '18 at 13:36
  • 1
    @karlphillip Got you. I don't know what makes the difference between our timings but I appreciate you looking at it again!! – Tai Mar 03 '18 at 17:21