1

Building on this question, starting with this pandas dataframe,

import pandas as pd
data = {'id':[1, 419, 425, 432],
        'city_0':['Prague', 'Prague', 'Copenhagen', 'Santiago'],
        'city_1':['Copenhagen', 'Barcelona', 'Barcelona', 'Berlin'],
        'Fare 0->1':[1000, 1200, 1500, 2050],
        'Fare 1->0':[1100, 1150, 1600, 2000]
}
df = pd.DataFrame(data)

Input df:

    id      city_0      city_1  Fare 0->1  Fare 1->0
0    1      Prague  Copenhagen       1000       1100
1  419      Prague   Barcelona       1200       1150
2  425  Copenhagen   Barcelona       1500       1600
3  432    Santiago      Berlin       2050       2000 

I'm trying to generate this kind of adjacency matrix where df.X.Y is the fare for going from X to Y.

Expected output:

           Prague Copenhagen Santiago Barcelona Berlin
Prague        NaN       1100      NaN      1150    NaN
Copenhagen   1000        NaN      NaN      1600    NaN
Santiago      NaN        NaN      NaN       NaN   2000
Barcelona    1200       1500      NaN       NaN    NaN
Berlin        NaN        NaN     2050       NaN    NaN

What I've tried:

df_city_0 = df[['city_0', ]].copy()
df_city_1 = df[['city_1', ]].copy()

df_city_0.columns = ['city']  # rename both the columns to a single name
df_city_1.columns = ['city']

df_cities = df_city_0.append(df_city_1)  # make them one column
df_cities = df_cities['city'].unique()

# array(['Prague', 'Copenhagen', 'Santiago', 'Barcelona', 'Berlin'], dtype=object)

df_fares_adjacency = pd.DataFrame(columns=df_cities, index=df_cities)

#            Prague Copenhagen Santiago Barcelona Berlin
# Prague        NaN        NaN      NaN       NaN    NaN
# Copenhagen    NaN        NaN      NaN       NaN    NaN
# Santiago      NaN        NaN      NaN       NaN    NaN
# Barcelona     NaN        NaN      NaN       NaN    NaN
# Berlin        NaN        NaN      NaN       NaN    NaN

for index, row in df.iterrows():
    df_fares_adjacency[row['city_0']][row['city_1']] = row['Fare 0->1']
    df_fares_adjacency[row['city_1']][row['city_0']] = row['Fare 1->0']

#            Prague Copenhagen Santiago Barcelona Berlin
# Prague        NaN       1100      NaN      1150    NaN
# Copenhagen   1000        NaN      NaN      1600    NaN
# Santiago      NaN        NaN      NaN       NaN   2000
# Barcelona    1200       1500      NaN       NaN    NaN
# Berlin        NaN        NaN     2050       NaN    NaN

This way I'm able to get the desired matrix, but looping over a dataframe feels wrong.


Is there a more efficient and 'pandasic' way than using df.iterrows() over what could potentially be a very large dataframe?

Pranab
  • 2,207
  • 5
  • 30
  • 50

2 Answers2

2

Very interesting question. I decided to time test the loop and the solution by ALollz and found the crosstab method is actually slower. I made a dummy DataFrame of length 5000, of 5000 cities matched with 5000 different cities (tested this using random strings). I then increased this to 7500 and 10000 cities.

  • The crosstab method took 7.2, 16.3 and 28.8 seconds resp.
  • However the loop method took only 5.8, 11.7, 20.1 seconds resp.

It seems you might be better off using your old code, even though it looks less pandasic.

I used the following code to test:

import string    
import random # define the random module  
import time
import pandas as pd
import numpy as np
def ran(x):
    "Function for random strings"
    S = 10  # number of characters in the string.  
    # call random.choices() string module to find the string in Uppercase + numeric data.  
    ran = ''.join(random.choices(string.ascii_uppercase + string.digits, k = S))    
    return ran
# Generate random values
N2 = 5000
to = np.random.rand(N2)
fro = np.random.rand(N2)
cities1 = [ran(i) for i in range(N2)]
cities2 = [ran(i) for i in range(N2)]
# Create dataframe
df = pd.DataFrame({"city_0":cities1,"city_1":cities2,'Fare 0->1':to,'Fare 1->0':fro},index=pd.RangeIndex(N2))


# Loop method:
timet = time.perf_counter()

df_city_0 = df[['city_0', ]].copy()
df_city_1 = df[['city_1', ]].copy()

df_city_0.columns = ['city']  # rename both the columns to a single name
df_city_1.columns = ['city']

df_cities = df_city_0.append(df_city_1)  # make them one column
df_cities = df_cities['city'].unique()

df_fares_adjacency = pd.DataFrame(columns=df_cities, index=df_cities)

for index, row in df.iterrows():
    df_fares_adjacency[row['city_0']][row['city_1']] = row['Fare 0->1']
    df_fares_adjacency[row['city_1']][row['city_0']] = row['Fare 1->0']
print(time.perf_counter()-timet)

# Crosstab method
timet = time.perf_counter()
t1 = pd.crosstab(df['city_1'], df['city_0'], values=df['Fare 0->1'], aggfunc='sum')
t2 = pd.crosstab(df['city_0'], df['city_1'], values=df['Fare 1->0'], aggfunc='sum')

df1 = t1.combine_first(t2)
print(time.perf_counter()-timet)



1

If your DataFrame is not duplicated on ['city_1', 'city_0'], you can do this with crosstab using the fares as the values and then combine_first to join the results. Or if you have duplicated rows then you could change the aggfunc to something like mean to get the average fare.

t1 = pd.crosstab(df['city_1'], df['city_0'], values=df['Fare 0->1'], aggfunc='sum')
t2 = pd.crosstab(df['city_0'], df['city_1'], values=df['Fare 1->0'], aggfunc='sum')

df1 = t1.combine_first(t2)

            Barcelona  Berlin  Copenhagen  Prague  Santiago
Barcelona         NaN     NaN      1500.0  1200.0       NaN
Berlin            NaN     NaN         NaN     NaN    2050.0
Copenhagen     1600.0     NaN         NaN  1000.0       NaN
Prague         1150.0     NaN      1100.0     NaN       NaN
Santiago          NaN  2000.0         NaN     NaN       NaN
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Is there any potential performance benefit to using `crosstab` and `combine_first`, over the `df.iterrows()` based solution? – Pranab Jul 22 '21 at 18:08
  • 1
    @Pranab No it doesn't seem to be faster, though a bit less writing. Guess `crosstab` is pretty slow under the hood. – ALollz Jul 22 '21 at 18:42