Apologies if this is a duplicate; I looked and couldn't find this question answered (well) online.
Let's say I have a pandas dataframe with a name
column. I'd like to query a database and get the id
associated with each name, and store that value in the dataframe.
In this case, I'm reading CSV files (with name
), and querying a MySQL database to get id
. I do some data cleaning and then upload the dataframe to the MySQL database (a different table then the one I query to get id
).
What's the best way to go about doing this? Here's what I came up with:
df["id"] = pd.Series(
[pd.read_sql_query(
f"SELECT id FROM table WHERE name LIKE '{name}'",
connection)["id"][0] for name in df["name"]]
)
As I understand it, this avoid modifying df
while iterating over it.
- We iterate over the values in
df["name"]
- The SQL query returns a DataFrame
- The
["id"][0]
pulls the value of interest (id
) out of the DataFrame and stores it in a list - This list of values is converted to a
pd.Series
- This
pd.Series
is added to the originaldf
This works, but it doesn't really force an association between name
and id
, which (I think) would be more robust. (And, obviously, there's no error handling in the code snippet above.)
I tried combinations of apply
or map
or iterrows
or iteritems
, and that didn't work so well. For example:
for index, name in df["name"].iteritems():
id_df = pd.read_sql_query(f"SELECT id FROM tableWHERE name LIKE '{name}'", connection)
temp_df[temp_df["name"] == name]["id"] = id_df["id"][0]
Warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead