20

I have a dataframe c with lots of different columns. Also, arr is a dataframe that corresponds to a subset of c: arr = c[c['A_D'] == 'A'].

The main idea of my code is to iterate over all rows in the c-dataframe and search for all the possible cases (in the arr dataframe) where some specific conditions should happen:

  • It is only necessary to iterate over rows were c['A_D'] == D and c['Already_linked'] == 0
  • The hour in the arr dataframe must be less than the hour_aux in the c dataframe
  • The column Already_linked of the arr dataframe must be zero: arr.Already_linked == 0
  • The Terminal and the Operator needs to be the same in the c and arr dataframe

Right now, the conditions are stored using both Boolean indexing and groupby get_group:

  • Groupby the arr dataframe in order to choose the same Operator and Terminal: g = groups.get_group((row.Operator, row.Terminal))
  • Choose only the arrivals where the hour is smaller than the hour in the c dataframe and where Already_linked==0: vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]

For each of the rows in the c dataframe that verify all conditions, a vb dataframe is created. Naturally, this dataframe has different lengths in each iteration. After creating the vb dataframe, my goal is to choose the index of the vb dataframe that minimises the time between vb.START and c[x]. The FightID that corresponds to this index is then stored in the c dataframe on column a. Additionally, since the arrival was linked to a departure, the column Already_linked in the arr dataframe is changed from 0 to 1.

It is important to notice that the column Already_linked of the arr dataframe may change in every iteration (and arr.Already_linked == 0 is one of the conditions to create the vb dataframe). Therefore, it is not possible to parallelize this code.

I have already used c.itertuples() for efficiency, however since c has millions of rows, this code is still too time consuming.

Other option would also be to use pd.apply to every row. Nonetheless, this is not really straightforward since in each loop there are values that change in both c and arr (also, I believe that even with pd.apply it would be extremely slow).

Is there any possible way to convert this for loop in a vectorized solution (or decrease the running time by 10X(if possible even more) )?

Initial dataframe:

START     END       A_D     Operator     FlightID    Terminal   TROUND_ID   tot
0   2017-03-26 16:55:00 2017-10-28 16:55:00 A   QR  QR001   4   QR002       70
1   2017-03-26 09:30:00 2017-06-11 09:30:00 D   DL  DL001   3   "        "  84
2   2017-03-27 09:30:00 2017-10-28 09:30:00 D   DL  DL001   3   "        "  78
3   2017-10-08 15:15:00 2017-10-22 15:15:00 D   VS  VS001   3   "        "  45
4   2017-03-26 06:50:00 2017-06-11 06:50:00 A   DL  DL401   3   "        "  9
5   2017-03-27 06:50:00 2017-10-28 06:50:00 A   DL  DL401   3   "        "  19
6   2017-03-29 06:50:00 2017-04-19 06:50:00 A   DL  DL401   3   "        "  3
7   2017-05-03 06:50:00 2017-10-25 06:50:00 A   DL  DL401   3   "        "  32
8   2017-06-25 06:50:00 2017-10-22 06:50:00 A   DL  DL401   3   "        "  95
9   2017-03-26 07:45:00 2017-10-28 07:45:00 A   DL  DL402   3   "        "  58

Desired Output (some of the columns were excluded in the dataframe below. Only the a and Already_linked columns are relevant):

    START                    END             A_D  Operator  a   Already_linked
0   2017-03-26 16:55:00 2017-10-28 16:55:00 A   QR  0               1
1   2017-03-26 09:30:00 2017-06-11 09:30:00 D   DL  DL402           1
2   2017-03-27 09:30:00 2017-10-28 09:30:00 D   DL  DL401           1
3   2017-10-08 15:15:00 2017-10-22 15:15:00 D   VS  No_link_found   0
4   2017-03-26 06:50:00 2017-06-11 06:50:00 A   DL  0               0
5   2017-03-27 06:50:00 2017-10-28 06:50:00 A   DL  0               1
6   2017-03-29 06:50:00 2017-04-19 06:50:00 A   DL  0               0
7   2017-05-03 06:50:00 2017-10-25 06:50:00 A   DL  0               0
8   2017-06-25 06:50:00 2017-10-22 06:50:00 A   DL  0               0
9   2017-03-26 07:45:00 2017-10-28 07:45:00 A   DL  0               1

Code:

groups = arr.groupby(['Operator', 'Terminal'])
for row in c[(c.A_D == "D") & (c.Already_linked == 0)].itertuples():
    try:
        g = groups.get_group((row.Operator, row.Terminal))
        vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]
        aux = (vb.START - row.x).abs().idxmin()
        c.loc[row.Index, 'a'] = vb.loc[aux].FlightID
        arr.loc[aux, 'Already_linked'] = 1
        continue
    except:
        continue

c['Already_linked'] = np.where((c.a != 0) & (c.a != 'No_link_found') & (c.A_D == 'D'), 1, c['Already_linked'])
c.Already_linked.loc[arr.Already_linked.index] = arr.Already_linked
c['a'] = np.where((c.Already_linked  == 0) & (c.A_D == 'D'),'No_link_found',c['a'])

Code for the initial c dataframe:

import numpy as np
import pandas as pd
import io

s = '''
 A_D     Operator     FlightID    Terminal   TROUND_ID   tot
 A   QR  QR001   4   QR002       70
 D   DL  DL001   3   "        "  84
 D   DL  DL001   3   "        "  78
 D   VS  VS001   3   "        "  45
 A   DL  DL401   3   "        "  9
 A   DL  DL401   3   "        "  19
 A   DL  DL401   3   "        "  3
 A   DL  DL401   3   "        "  32
 A   DL  DL401   3   "        "  95
 A   DL  DL402   3   "        "  58
'''

data_aux = pd.read_table(io.StringIO(s), delim_whitespace=True)
data_aux.Terminal = data_aux.Terminal.astype(str)
data_aux.tot= data_aux.tot.astype(str)

d = {'START': ['2017-03-26 16:55:00', '2017-03-26 09:30:00','2017-03-27 09:30:00','2017-10-08 15:15:00',
           '2017-03-26 06:50:00','2017-03-27 06:50:00','2017-03-29 06:50:00','2017-05-03 06:50:00',
           '2017-06-25 06:50:00','2017-03-26 07:45:00'], 'END': ['2017-10-28 16:55:00' ,'2017-06-11 09:30:00' ,
           '2017-10-28 09:30:00' ,'2017-10-22 15:15:00','2017-06-11 06:50:00' ,'2017-10-28 06:50:00', 
           '2017-04-19 06:50:00' ,'2017-10-25 06:50:00','2017-10-22 06:50:00' ,'2017-10-28 07:45:00']}    

aux_df = pd.DataFrame(data=d)
aux_df.START = pd.to_datetime(aux_df.START)
aux_df.END = pd.to_datetime(aux_df.END)
c = pd.concat([aux_df, data_aux], axis = 1)
c['A_D'] = c['A_D'].astype(str)
c['Operator'] = c['Operator'].astype(str)
c['Terminal'] = c['Terminal'].astype(str)

c['hour'] = pd.to_datetime(c['START'], format='%H:%M').dt.time
c['hour_aux'] = pd.to_datetime(c['START'] - pd.Timedelta(15, unit='m'), 
format='%H:%M').dt.time
c['start_day'] = c['START'].astype(str).str[0:10]
c['end_day'] = c['END'].astype(str).str[0:10]
c['x'] = c.START -  pd.to_timedelta(c.tot.astype(int), unit='m')
c["a"] = 0
c["Already_linked"] = np.where(c.TROUND_ID != "        ", 1 ,0)

arr = c[c['A_D'] == 'A']
Miguel Lambelho
  • 334
  • 1
  • 2
  • 14
  • Possible duplicate of [What is the fastest/most efficient way to loop through a large collection of files and save a plot of the data?](https://stackoverflow.com/questions/23572714/what-is-the-fastest-most-efficient-way-to-loop-through-a-large-collection-of-fil) – Adam Ostrožlík Sep 12 '18 at 10:55

4 Answers4

4

Your question was if there is a way to vectorize the for loop, but I think that question hides what you really want which is an easy way to speed your code up. For performance questions, a good starting point is always profiling. However, I have a strong suspicion that the dominant operation in your code is .query(row.query_string). Running that for every row is expensive if arr is large.

For arbitrary queries, that runtime can't really be improved at all without removing dependencies between iterations and parallelizing the expensive step. You might be a bit luckier though. Your query string always checks two different columns to see if they're equal to something you care about. However, for each row that requires going through your entire slice of arr. Since the slice changes each time, that could cause problems, but here are some ideas:

  • Since you're slicing arr each time anyway, maintain a view of just the arr.Already_Linked==0 rows so you're iterating over a smaller object.
  • Better yet, before you do any looping you should first group arr by Terminal and Operator. Then, instead of running through all of arr, first select the group you want and then do your slicing and filtering. This would require rethinking the exact implementation of query_string a little bit, but the advantage is that if you have a lot of terminals and operators you'll typically be working over a much smaller object than arr. Moreover, you wouldn't even have to query that object since that was implicitly done by the groupby.
  • Depending on how aux.hour typically relates to row.hour_aux, you might have improvements by sorting aux at the beginning with respect to hour. Just using the inequality operator you probably wouldn't see any gains, but you could pair that with a logarithmic search for the cutoff point and then just slice up to that cutoff point.
  • And so on. Again, I suspect any way of restructuring the query you're doing on all of arr for every row will offer substantially more gains than just switching frameworks or vectorizing bits and pieces.

Expanding on some of those points a little bit and adapting @DJK's code a bit, look at what happens when we have the following changes.

groups = arr.groupby(['Operator', 'Terminal'])

for row in c[(c.A_D == 'D') & (c.Already_linked == 0)].itertuples():
    g = groups.get_group((row.Operator, row.Terminal))
    vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]
    try:
        aux = (vb.START - row.x).abs().idxmin()
        print(row.x)
        c.loc[row.Index, 'a'] = vb.loc[aux,'FlightID']
        g.loc[aux, 'Already_linked'] = 1
        continue
    except:
        continue

Part of the reason your query is so slow is because it's searching over all of arr each time. In contrast, the .groupby() executes in roughly the same time as one query, but then for every subsequent iteration you can just use .get_group() to efficiently find the tiny subset of the data you care about.

A helpful (extremely crude) rule of thumb when benchmarking is that a billion things takes a second. If you're seeing much longer times than that for something measured in millions of things, like your millions of rows, that means that for each of those rows you're doing tons of things to get up to billions of operations. That leaves a ton of potential for better algorithms to reduce the number of operations, whereas vectorization really only yields constant factor improvements (and for many string/query operations not even great improvements at that).

Hans Musgrave
  • 6,613
  • 1
  • 18
  • 37
  • Yes. the `.query(row.query_string)` is by far what consumes the most time. The problem is that `arr.Already_Linked` changes in every iteration... therefore I do not know how can I not use a for loop (the sorting of the hour may help). Thank you. – Miguel Lambelho Sep 12 '18 at 17:09
  • It changes, but if you do something like `groups = arr.groupby('Terminal')`, then selecting `group = groups.get_group('my_terminal')` is extremely fast as opposed to running a query. You then filter that group by `arr.Already_Linked==0` rather than filtering all of `arr`. It gets even better if you group by Terminal and Operator since the groups will be smaller, and you only have to do that groupby operation once before you ever loop over all the rows. – Hans Musgrave Sep 12 '18 at 17:15
  • I am not sure if I understood it completely. Do you think you can adapt the code that I sent? I would really appreciate it. I also edited the code according with DJK advice. But it is still to time consuming. – Miguel Lambelho Sep 12 '18 at 18:10
  • @MiguelLambelho Sorry it took so long. I had to go to work. Benchmark the change I proposed and see if it's faster for you. Unless you have something weird like every terminal and operator being the same, it should offer huge improvements. – Hans Musgrave Sep 13 '18 at 02:37
  • Thank you for the help @Hans Musgrave. The groupby definitely helped. Unfortunately, it still takes way more than the time constraints that I have. Do you know any more possibilities to decrease exponentially the running time? – Miguel Lambelho Sep 13 '18 at 09:43
  • I updated the question already with your advice. Thank you again. – Miguel Lambelho Sep 13 '18 at 10:20
  • @MiguelLambelho Maybe. How long is it taking, how long should it take, and have you done any profiling yet? How much RAM is it using? What is the CPU utilization? Which lines of code are the hotspots? The problem could be something as simple as excessive paging, and a solution would be to switch from pandas to a more memory-efficient implementation. We need some more details. – Hans Musgrave Sep 13 '18 at 15:09
  • Right now, it is taking around 25/30 minutes. Ideally, I would like to run it in 10 minutes (or even below, although I acknowledge that it would be really difficult to do it). One of the problems that is so slow is because a new `vb` dataframe is created for every single row. But I believe that the major problem is definitely what you mention: the fact that is written with `pandas`. Could you help me in changing it to a more memory-efficient implementation? Thank you very much in advance, I really appreciate your help. – Miguel Lambelho Sep 13 '18 at 16:44
  • Let me know if you need any additional information. – Miguel Lambelho Sep 14 '18 at 08:28
4

While this is not a vecterized solution, it should speed things up rapidly if your sample data set mimics your true data set. Currently, you are wasting time looping over every row, but you only care about looping over rows where ['A_D'] == 'D' and ['Already_linked'] ==0. Instead remove the if's and loop over the truncated dataframe which is only 30% of the initial dataframe

for row in c[(c.A_D == 'D') & (c.Already_linked == 0)].itertuples():

    vb = arr[(arr.Already_linked == 0) & (arr.hour < row.hour_aux)].copy().query(row.query_string)
    try:
        aux = (vb.START - row.x).abs().idxmin()
        print(row.x)
        c.loc[row.Index, 'a'] = vb.loc[aux,'FlightID']
        arr.loc[aux, 'Already_linked'] = 1
        continue
    except:
        continue
DJK
  • 8,924
  • 4
  • 24
  • 40
  • Thank you for your comment @DJK. It definitely helps. However, the dataframe is still way too big. But thank you!! – Miguel Lambelho Sep 12 '18 at 17:06
  • already edited the answer based on your comment. Thank you. Do you know about any other option to decrease exponentially the running time? – Miguel Lambelho Sep 12 '18 at 17:44
  • 1
    Will try and update the answer later, but try not to use`query()` it’s slower than Boolean indexing and since the filter is static and dependent on the current row just add it to the filter already done in the loop – DJK Sep 12 '18 at 22:30
  • Ok. Will use the `vb = arr[(arr.Already_linked == 0) & (arr.hour < row.hour_aux) & (arr.Operator == row.Operator) & (arr.Terminal == row.Terminal)]` from now on. Thank you so very much! Will wait for your update. I am really grateful. – Miguel Lambelho Sep 12 '18 at 22:35
  • the groupby proposed by Hans Musgrave really helped. I updated the question with both your advices. If perhaps do you know more additional ways to make it faster, I would really appreciate it. – Miguel Lambelho Sep 13 '18 at 10:22
2

This solution uses pd.DataFrame.isin which uses numpy.in1d

Apparently 'isin' isn't necessarily faster for small datasets (like this sample), but is significantly faster for large datasets. You'll have to run it against your data to determine performance.

flight_record_linkage.ipynb

Expanded the dataset using c = pd.concat([c] * 10000, ignore_index=True)

  • Increase the dataset length by 3 orders of magnitude (10000 rows total).
    • Original method: Wall time: 8.98s
    • New method: Wall time: 16.4s
  • Increase the dataset length by 4 orders of magnitude (100000 rows total).
    • Original method: Wall time: 8min 17s
    • New method: Wall time: 1min 14s
  • Increase the dataset length by 5 orders of magnitude (1000000 rows total).
    • New method: Wall time: 11min 33s

New Method: Using isin and apply

def apply_do_g(it_row):
    """
    This is your function, but using isin and apply
    """

    keep = {'Operator': [it_row.Operator], 'Terminal': [it_row.Terminal]}  # dict for isin combined mask

    holder1 = arr[list(keep)].isin(keep).all(axis=1)  # create boolean mask
    holder2 = arr.Already_linked.isin([0])  # create boolean mask
    holder3 = arr.hour < it_row.hour_aux  # create boolean mask

    holder = holder1 & holder2 & holder3  # combine the masks

    holder = arr.loc[holder]

    if not holder.empty:

        aux = np.absolute(holder.START - it_row.x).idxmin()

        c.loc[it_row.name, 'a'] = holder.loc[aux].FlightID  # use with apply 'it_row.name'

        arr.loc[aux, 'Already_linked'] = 1


def new_way_2():
    keep = {'A_D': ['D'], 'Already_linked': [0]}
    df_test = c[c[list(keep)].isin(keep).all(axis=1)].copy()  # returns the resultant df
    df_test.apply(lambda row: apply_do_g(row), axis=1)  # g is multiple DataFrames"


#call the function
new_way_2()
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
1

Your problem looks like one of the most common problems in database operation. I do not fully understand what you want to get because you have not formulated the task. Now to the possible solution - avoid loops at all.

You have a very long table with columns time, FlightID, Operator, Terminal, A_D. Other columns and dates do not matter if I understand you correctly. Also start_time and end_time are the same in every row. By the way you may get time column with the code table.loc[:, 'time'] = table.loc[:, 'START'].dt.time.

  1. table = table.drop_duplicates(subset=['time', 'FlightID', 'Operator', 'Terminal']). And your table gonna become significantly shorter.

  2. Split table into table_arr and table_dep according to A_D value: table_arr = table.loc[table.loc[:, 'A_D'] == 'A', ['FlightID', 'Operator', 'Terminal', 'time']], table_dep = table.loc[table.loc[:, 'A_D'] == 'D', ['FlightID', 'Operator', 'Terminal', 'time']]

  3. Seems like all you tried to get with loops you may get with a single line: table_result = table_arr.merge(table_dep, how='right', on=['Operator', 'Terminal'], suffixes=('_arr', '_dep')). It is basically the same operation as JOIN in SQL.

According to my understanding of your problem and having the tiny piece of data you have provided you get just the desired output (correspondence between FlightID_dep and FlightID_arr for all FlightID_dep values) without any loop so much faster. table_result is:

  FlightID_arr Operator  Terminal  time_arr FlightID_dep  time_dep
0        DL401       DL         3  06:50:00        DL001  09:30:00
1        DL402       DL         3  07:45:00        DL001  09:30:00
2          NaN       VS         3       NaN        VS001  15:15:00

Of course, in general case (with actual data) you will need one more step - filter table_result on condition time_arr < time_dep or any other condition you have. Unfortunately the data you have provided is not enough to fully solve your problem.

Complete code is:

import io
import pandas as pd

data = '''
START,END,A_D,Operator,FlightID,Terminal,TROUND_ID,tot
2017-03-26 16:55:00,2017-10-28 16:55:00,A,QR,QR001,4,QR002,70
2017-03-26 09:30:00,2017-06-11 09:30:00,D,DL,DL001,3,,84
2017-03-27 09:30:00,2017-10-28 09:30:00,D,DL,DL001,3,,78
2017-10-08 15:15:00,2017-10-22 15:15:00,D,VS,VS001,3,,45
2017-03-26 06:50:00,2017-06-11 06:50:00,A,DL,DL401,3,,9
2017-03-27 06:50:00,2017-10-28 06:50:00,A,DL,DL401,3,,19
2017-03-29 06:50:00,2017-04-19 06:50:00,A,DL,DL401,3,,3
2017-05-03 06:50:00,2017-10-25 06:50:00,A,DL,DL401,3,,32
2017-06-25 06:50:00,2017-10-22 06:50:00,A,DL,DL401,3,,95
2017-03-26 07:45:00,2017-10-28 07:45:00,A,DL,DL402,3,,58
'''

table = pd.read_csv(io.StringIO(data), parse_dates=[0, 1])
table.loc[:, 'time'] = table.loc[:, 'START'].dt.time
table = table.drop_duplicates(subset=['time', 'FlightID', 'Operator', 'Terminal'])
table_arr = table.loc[table.loc[:, 'A_D'] == 'A', ['FlightID', 'Operator', 'Terminal', 'time']]
table_dep = table.loc[table.loc[:, 'A_D'] == 'D', ['FlightID', 'Operator', 'Terminal', 'time']]

table_result = table_arr.merge(
    table_dep,
    how='right',
    on=['Operator', 'Terminal'],
    suffixes=('_arr', '_dep'))
print(table_result)
  • Thank you very much for your answer Poolka. I am still not sure if I understood it completely. Do you think you can post here the entire code that led you to the `table_result`? And could you explain me why you do not filter time_arr < time_dep in this example? – Miguel Lambelho Sep 15 '18 at 07:52
  • @MiguelLambelho All the code is already posted, all the loops are replaced by a single line. I did not filter the result because sample data is too small - filtering is not needed for provided 10 rows of data. –  Sep 15 '18 at 08:26
  • I tested your solution and I found a small problem. By using the merge method, you will have several departure flights linked to the same arrival. That is why I used the column `Already_linked` and in each iteration that column would be updated. Is there any way to solve this problem? – Miguel Lambelho Sep 17 '18 at 11:34
  • @MiguelLambelho You question does not contain the full description of your task. So I can't know it obviously. According to how you have stated the problem you can choose random departure for every arrival. If so then use `table_result.drop_duplicates(subset=['FlightID_dep', 'FlightID_arr'])` to perform additional filtering. Your sample data does not allow to try and handle such cases. –  Sep 17 '18 at 12:11
  • Yes. I understand that. But, the code that I presented includes the possibility to choose the arrivals that is the closest to a specific time. So, is there any way I can use this merge method so that the `time` is the closest possible to a specific time in the `x` column? – Miguel Lambelho Sep 17 '18 at 12:26
  • @MiguelLambelho Then find the difference between `time_arr` and `time_dep`, sort `table_result` by computed difference in ascending order and `drop_duplicates` with argument `keep='first'`. This is pretty straightforward way to get what you want. Or think of more elegant/optimized filtering. –  Sep 17 '18 at 12:34