1

I have a question on Python and database design. Let's say I have a large database (tabular like SQL) with many columns (features) and millions of rows (records).

I would like to extract each record with many features to be processed together.

The database is very large, like having millions of records, and tens of thousands of features. What I am currently doing is to extract each individual column into a list, so that each record can be referenced with the same index across different lists.
Is there a better way to do this? I'm thinking if a dataframe would be better?

As an example, I am processing 3 records (actually having millions of records), so that I can display them, like in a print statement:

namelist = ['Peter', 'John', 'Susan']
agelist = [16, 17, 18]
activitylist = ['play tennis', 'play chess', 'swim']

for i, name in enumerate(namelist):
    print('Hi, my name is ' + name + '. I am ' + str(agelist[i]) + ' years old and I like to ' + activitylist[i])

Output:

Hi, my name is Peter. I am 16 years old and I like to play tennis  
Hi, my name is John. I am 17 years old and I like to play chess  
Hi, my name is Susan. I am 18 years old and I like to swim
Sy.Yah
  • 25
  • 6
  • Also, in your example, most of the time will be taken up `print`ing millions of lines to the console, so the way data is stored is less relevant. If you could write the output to a file or simply store it in another variable, this will be much quicker. – Stuart Apr 12 '23 at 10:17
  • may I know how to do vectorize the dataframe for the string concatenation example? – Sy.Yah Apr 13 '23 at 05:45
  • 1
    @perpetualstudent's answer shows how to vectorize – Stuart Apr 13 '23 at 10:48

2 Answers2

2

I'd be using below set up codes to elaborate my answer, and for time measurement. Note that the dataframe has 3 million records, and I've picked an operation to append these data to a variable as an example.

import pandas as pd
import time

namelist = ['Peter', 'John', 'Susan'] *1000000
agelist = [16, 17, 18] *1000000
activitylist = ['play tennis', 'play chess', 'swim'] *1000000
df = pd.DataFrame({'name': namelist, 'age': agelist, 'activity': activitylist})

Your original method is already very efficient, especially if the data comes readily in separate lists, and the operation takes about 1.8s on my machine:

start = time.time()
result = []
for i, name in enumerate(namelist):
    result.append('Hi, my name is ' + name + '. I am ' + str(agelist[i]) + ' years old and I like to ' + activitylist[i])
end = time.time()
print(end - start)

Output:

1.7815442085266113

Let me elaborate some alternate methods if the data comes in a dataframe like this:

df = pd.DataFrame({'name': namelist, 'age': agelist, 'activity': activitylist})

Method (1) using df.iterrows()
This method iterates through row by row, and it's very slow. The documentation on iteration has a warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed...

Anyway this method takes about 112.3s on my machine:

start = time.time()
result = []
for i, row in df.iterrows():
    result.append('Hi, my name is ' + row['name'] + '. I am ' + str(row['age']) + ' years old and I like to ' + row['activity'])
end = time.time()
print(end - start)

Output:

112.2983672618866

Method (2) using df.to_numpy()
This method converts the dataframe to a numpy array row-wise, then iterate through each array using index. This is the closest to the list manipulation you have originally. It takes about 2.7s on my machine:

start = time.time()
result = []
for row in df.to_numpy():
    result.append('Hi, my name is ' + row[0] + '. I am ' + str(row[1]) + ' years old and I like to ' + row[2])
end = time.time()
print(end - start)

Output:

2.7370002269744873

Method (3) Vectorization
The non-vectorized method (like df.iterrows() or df.apply()) calls a Python function for every row, and that Python function does additional operations. In contrast, this vectorized operation is much faster because it avoids using Python code in inner loops. It takes about 1.9s on my machine:

start = time.time()
df.age = df.age.astype('str')
df['result'] = 'Hi, my name is ' + df.name + '. I am ' + df.age + ' years old and I like to ' + df.activity
result = df.result.tolist()
end = time.time()
print(end - start)

Output:

1.8785054683685303

Method (4) List Comprehension with Zip
This method as suggested by @Stuart seems the fastest overall! It took only about 0.7s on my machine:

start = time.time()
result = [f'Hi, my name is {name}. I am {age} years old and I like to {activity}'
           for name, age, activity in zip(namelist, agelist, activitylist)]
end = time.time()
print(end - start)

Output:

0.7034788322448731
blackraven
  • 5,284
  • 7
  • 19
  • 45
  • Thanks I would be using Method (3) Vectorization as the original data comes in tabular format! – Sy.Yah May 30 '23 at 08:11
1

The fastest way to do this seems to be zipping the lists together, then using a list comprehension with formatted string:

result = [f'Hi, my name is {name}. I am {age} years old and I like to {activity}'
           for name, age, activity in zip(namelist, agelist, activitylist)]

The vectorized pandas method (as shown in @perpetualstudent's answer) seems to be slightly slower than this, even if you can store age as strings so that the string conversion is not needed. This is apparently because string operations cannot be sped up through vectorization to the same extent as mathematical operations in pandas.

Stuart
  • 9,597
  • 1
  • 21
  • 30