Inserting MySQL results into an existing pandas dataframe

Question

Apologies if this is a duplicate; I looked and couldn't find this question answered (well) online.

Let's say I have a pandas dataframe with a name column. I'd like to query a database and get the id associated with each name, and store that value in the dataframe.

In this case, I'm reading CSV files (with name), and querying a MySQL database to get id. I do some data cleaning and then upload the dataframe to the MySQL database (a different table then the one I query to get id).

What's the best way to go about doing this? Here's what I came up with:

df["id"] = pd.Series(
    [pd.read_sql_query(
        f"SELECT id FROM table WHERE name LIKE '{name}'",
            connection)["id"][0] for name in df["name"]]
    )

As I understand it, this avoid modifying df while iterating over it.

We iterate over the values in df["name"]
The SQL query returns a DataFrame
The ["id"][0] pulls the value of interest (id) out of the DataFrame and stores it in a list
This list of values is converted to a pd.Series
This pd.Series is added to the original df

This works, but it doesn't really force an association between name and id, which (I think) would be more robust. (And, obviously, there's no error handling in the code snippet above.)

I tried combinations of apply or map or iterrows or iteritems, and that didn't work so well. For example:

for index, name in df["name"].iteritems():
    id_df = pd.read_sql_query(f"SELECT id FROM tableWHERE name LIKE '{name}'", connection)
    temp_df[temp_df["name"] == name]["id"] = id_df["id"][0]

Warning:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

score 0 · Accepted Answer · answered May 01 '18 at 06:14

Inspired by this SO thread, I'd suggest concatenating all of your names into a huge string, then inserting this string into a single SQL query (instead of a separate query per row) to get a DataFrame that maps each name to its ID.

Try something like this (can't test now, sorry!):

# Build a string of comma-separated, quoted names:
# https://stackoverflow.com/q/12007686
names = '"{0}"'.format('", "'.join(df['name']))

# Get a DataFrame of names and IDs
temp = pd.read_sql_query(f"SELECT name, id FROM table WHERE name IN ({names})",
                         connection)

# Build a pandas Series that maps each name to its ID
# Could also use a python dict. Also, ".values" may not be needed here
name2id = pd.Series(index=temp['name'].values, 
                    data=temp['id'].values)

# Create the ID column in your DataFrame. There are other ways (join, etc.)
df['id'] = df['name'].map(name2id)

Thank you, I will test this and post results. 6x fewer SQL queries is a good start! — Evan, May 01 '18 at 23:13
This worked with some minor modifications. Since some of my results are Empty Sets (SQL) or np.nan (Python float), using a pandas join on `name` worked best. I'd rather join on a numerical field but that's not always possible. Thanks much! — Evan, May 02 '18 at 02:40

Inserting MySQL results into an existing pandas dataframe

1 Answers1