2

There is a very popular answer on stackoverflow that you should not iterate over Pandas' dataframes:

https://stackoverflow.com/a/55557758/11826257

In my case, I want to take values from two columns of a dataframe and create a list of SQL INSERT INTO... statements with them. Like this:

import pandas as pd

df = pd.DataFrame({'velocity':[12,10,15], 'color':['blue','green','yellow']})

mylist = list()
for index, row in df.iterrows():
    mylist.append('INSERT INTO mytable(velocity, color) VALUES (' + \
                  str(row['velocity']) + \
                  ', "' + \
                  str(row['color']) + \
                  '");' )

[print(x) for x in mylist]
# INSERT INTO mytable(velocity, color) VALUES (12, "blue");
# INSERT INTO mytable(velocity, color) VALUES (10, "green");
# INSERT INTO mytable(velocity, color) VALUES (15, "yellow");

I understand that I could write something like this: [mylist.append('INSERT INTO mytable(velocity) VALUES ('+ str(x) + ');') for x in df["velocity"]] if I were only interested in one column. But is that what is meant by "vectorization"? And how would it apply to a case where you need two items from each row of a pandas' dataframe?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Snoeren01
  • 343
  • 3
  • 12

2 Answers2

3

Vectorized version would be something like this,

queries = (
    'INSERT INTO mytable(velocity, color) VALUES (' +
    df['velocity'].astype(str) +
    ', "' +
    df['color'].astype(str) +
    '");'
)
print(queries.to_list())

Output

['INSERT INTO mytable(velocity, color) VALUES (12, "blue");',
 'INSERT INTO mytable(velocity, color) VALUES (10, "green");',
 'INSERT INTO mytable(velocity, color) VALUES (15, "yellow");']

Efficient insert into Database table

df[['velocity', 'color']].to_sql(
    name='table_name',
    con=engine,
    schema='online',
    index=False,
    if_exists='append'
)
Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55
0

By default, Pandas/Numpy has minimal vectorization on string operations. One thing you can do is to avoid append as it can be costly when you have a long dataframe:

mylist = ['INSERT INTO mytable(velocity, color) VALUES (' + \
                  str(row['velocity']) + \
                  ', "' + \
                  str(row['color']) + \
                  '");' 
          for index, row in df.iterrows()
        ]
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74