2

I have the following dataframe (df_inventory) of unique cars that has around 3000 rows:

    stock_id  car_brand   car_model     age_group      mileage_group
    AS12354   Mazda       CX-7          6-10           50-100k
    DX22234   Toyota      Verso         11>            150k>
    KL32423   Volkswagen  Touran        11>            150k>
    SU12121   Renault     Twingo        6-10           50-100k
    ...       ...         ...           ...            ...

and then I have the following dataframe (df_main) of (unique) cars that has around 100000 rows:

    clientID  car_brand   car_model     age_group      mileage_group
    23132     Volkswagen  Golf          6-10           50-100k
    24234     Renault     Twingo        11>            150k>
    19055     Volkswagen  Polo          11>            150k>
    23245     Renault     Clio          1-2            0-50k
    ...       ...         ...           ...            ...

What I would like to get at the end is a dataframe with two columns. One will be the clientID and the second one will be the Stock_id IF there is a hard match on all other columns (car_brand, car_model, age_group, mileage_group).

This can be achieved using two for loops, iterating every combination between stock_id and clientID. But that leads to 3000*100000 iterations and it takes so much time:

clientID=[]
stock_id=[]

for i in range(df_main.shape[0]):
        for j in range(df_inventory.shape[0]):

            if (df_main.iloc[i,1]==df_inventory.iloc[j,1] 
            and df_main.iloc[i,2]==df_inventory.iloc[j,2]
            and df_main.iloc[i,3]==df_inventory.iloc[j,3]
            and df_main.iloc[i,4]==df_inventory.iloc[j,4] ):

                        clientID.append(df_main.iloc[i,0])
                        stock_id.append(df_inventory.iloc[j,0])

Is there any pythonic way (list comprehension) for example that could make this process faster?

glibdud
  • 7,550
  • 4
  • 27
  • 37
Marios
  • 26,333
  • 8
  • 32
  • 52
  • Use [`pd.merge`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) (or `join` if joining on indexes). It's a one liner. You'll need to specify `how="inner"` (well, that's the default so you could leave it out) and then something like `on=df.columns[1:]`. When it comes to pandas, if you find yourself writing a loop, including a comprehension, then you are probably doing it wrong. – Dan Oct 08 '19 at 14:09
  • 2
    it sounds like you want an inner join on `(car_brand, car_model, age_group, mileage_group)` – Buckeye14Guy Oct 08 '19 at 14:10
  • Recommend you take a look at https://stackoverflow.com/a/55557758/1011724 – Dan Oct 08 '19 at 14:14

1 Answers1

0

Thanks a lot, eventually it was very simple:

merged = df_main.merge(df_inventory, how='inner', on=['car_brand', 
'car_main_type','car_age_years_group','mileage_group'])
Marios
  • 26,333
  • 8
  • 32
  • 52