How to properly vectorize instead of iterate?

Question

There is a very popular answer on stackoverflow that you should not iterate over Pandas' dataframes:

https://stackoverflow.com/a/55557758/11826257

In my case, I want to take values from two columns of a dataframe and create a list of SQL INSERT INTO... statements with them. Like this:

import pandas as pd

df = pd.DataFrame({'velocity':[12,10,15], 'color':['blue','green','yellow']})

mylist = list()
for index, row in df.iterrows():
    mylist.append('INSERT INTO mytable(velocity, color) VALUES (' + \
                  str(row['velocity']) + \
                  ', "' + \
                  str(row['color']) + \
                  '");' )

[print(x) for x in mylist]
# INSERT INTO mytable(velocity, color) VALUES (12, "blue");
# INSERT INTO mytable(velocity, color) VALUES (10, "green");
# INSERT INTO mytable(velocity, color) VALUES (15, "yellow");

I understand that I could write something like this: [mylist.append('INSERT INTO mytable(velocity) VALUES ('+ str(x) + ');') for x in df["velocity"]] if I were only interested in one column. But is that what is meant by "vectorization"? And how would it apply to a case where you need two items from each row of a pandas' dataframe?

Vishnudev Krishnadas · Accepted Answer · 2020-11-03T18:19:37.167

3

Vectorized version would be something like this,

queries = (
    'INSERT INTO mytable(velocity, color) VALUES (' +
    df['velocity'].astype(str) +
    ', "' +
    df['color'].astype(str) +
    '");'
)
print(queries.to_list())

Output

['INSERT INTO mytable(velocity, color) VALUES (12, "blue");',
 'INSERT INTO mytable(velocity, color) VALUES (10, "green");',
 'INSERT INTO mytable(velocity, color) VALUES (15, "yellow");']

Efficient insert into Database table

df[['velocity', 'color']].to_sql(
    name='table_name',
    con=engine,
    schema='online',
    index=False,
    if_exists='append'
)

edited Nov 03 '20 at 18:19

answered Nov 03 '20 at 15:48

Vishnudev Krishnadas

10,679
2
23
55

Very nice. I added `queries.apply(cursor.execute)` and my data get imported into the database exactly how I wanted it. – Snoeren01 Nov 03 '20 at 17:47
1

Glad it worked. But it is inefficient to do it that way. – Vishnudev Krishnadas Nov 03 '20 at 18:13

score 0 · Answer 2 · answered Nov 03 '20 at 15:40

By default, Pandas/Numpy has minimal vectorization on string operations. One thing you can do is to avoid append as it can be costly when you have a long dataframe:

mylist = ['INSERT INTO mytable(velocity, color) VALUES (' + \
                  str(row['velocity']) + \
                  ', "' + \
                  str(row['color']) + \
                  '");' 
          for index, row in df.iterrows()
        ]

How to properly vectorize instead of iterate?

2 Answers2