I'd be using below set up codes to elaborate my answer, and for time measurement. Note that the dataframe has 3 million records, and I've picked an operation to append these data to a variable as an example.
import pandas as pd
import time
namelist = ['Peter', 'John', 'Susan'] *1000000
agelist = [16, 17, 18] *1000000
activitylist = ['play tennis', 'play chess', 'swim'] *1000000
df = pd.DataFrame({'name': namelist, 'age': agelist, 'activity': activitylist})
Your original method is already very efficient, especially if the data comes readily in separate lists, and the operation takes about 1.8s on my machine:
start = time.time()
result = []
for i, name in enumerate(namelist):
result.append('Hi, my name is ' + name + '. I am ' + str(agelist[i]) + ' years old and I like to ' + activitylist[i])
end = time.time()
print(end - start)
Output:
1.7815442085266113
Let me elaborate some alternate methods if the data comes in a dataframe like this:
df = pd.DataFrame({'name': namelist, 'age': agelist, 'activity': activitylist})
Method (1) using df.iterrows()
This method iterates through row by row, and it's very slow. The documentation on iteration has a warning box that says:
Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed...
Anyway this method takes about 112.3s on my machine:
start = time.time()
result = []
for i, row in df.iterrows():
result.append('Hi, my name is ' + row['name'] + '. I am ' + str(row['age']) + ' years old and I like to ' + row['activity'])
end = time.time()
print(end - start)
Output:
112.2983672618866
Method (2) using df.to_numpy()
This method converts the dataframe to a numpy array row-wise, then iterate through each array using index. This is the closest to the list manipulation you have originally. It takes about 2.7s on my machine:
start = time.time()
result = []
for row in df.to_numpy():
result.append('Hi, my name is ' + row[0] + '. I am ' + str(row[1]) + ' years old and I like to ' + row[2])
end = time.time()
print(end - start)
Output:
2.7370002269744873
Method (3) Vectorization
The non-vectorized method (like df.iterrows()
or df.apply()
) calls a Python function for every row, and that Python function does additional operations. In contrast, this vectorized operation is much faster because it avoids using Python code in inner loops. It takes about 1.9s on my machine:
start = time.time()
df.age = df.age.astype('str')
df['result'] = 'Hi, my name is ' + df.name + '. I am ' + df.age + ' years old and I like to ' + df.activity
result = df.result.tolist()
end = time.time()
print(end - start)
Output:
1.8785054683685303
Method (4) List Comprehension with Zip
This method as suggested by @Stuart seems the fastest overall! It took only about 0.7s on my machine:
start = time.time()
result = [f'Hi, my name is {name}. I am {age} years old and I like to {activity}'
for name, age, activity in zip(namelist, agelist, activitylist)]
end = time.time()
print(end - start)
Output:
0.7034788322448731