0

Apologies if this is a duplicate; I looked and couldn't find this question answered (well) online.

Let's say I have a pandas dataframe with a name column. I'd like to query a database and get the id associated with each name, and store that value in the dataframe.

In this case, I'm reading CSV files (with name), and querying a MySQL database to get id. I do some data cleaning and then upload the dataframe to the MySQL database (a different table then the one I query to get id).

What's the best way to go about doing this? Here's what I came up with:

df["id"] = pd.Series(
    [pd.read_sql_query(
        f"SELECT id FROM table WHERE name LIKE '{name}'",
            connection)["id"][0] for name in df["name"]]
    )

As I understand it, this avoid modifying df while iterating over it.

  • We iterate over the values in df["name"]
  • The SQL query returns a DataFrame
  • The ["id"][0] pulls the value of interest (id) out of the DataFrame and stores it in a list
  • This list of values is converted to a pd.Series
  • This pd.Series is added to the original df

This works, but it doesn't really force an association between name and id, which (I think) would be more robust. (And, obviously, there's no error handling in the code snippet above.)

I tried combinations of apply or map or iterrows or iteritems, and that didn't work so well. For example:

for index, name in df["name"].iteritems():
    id_df = pd.read_sql_query(f"SELECT id FROM tableWHERE name LIKE '{name}'", connection)
    temp_df[temp_df["name"] == name]["id"] = id_df["id"][0] 

Warning:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Evan
  • 2,121
  • 14
  • 27

1 Answers1

0

Inspired by this SO thread, I'd suggest concatenating all of your names into a huge string, then inserting this string into a single SQL query (instead of a separate query per row) to get a DataFrame that maps each name to its ID.

Try something like this (can't test now, sorry!):

# Build a string of comma-separated, quoted names:
# https://stackoverflow.com/q/12007686
names = '"{0}"'.format('", "'.join(df['name']))

# Get a DataFrame of names and IDs
temp = pd.read_sql_query(f"SELECT name, id FROM table WHERE name IN ({names})",
                         connection)

# Build a pandas Series that maps each name to its ID
# Could also use a python dict. Also, ".values" may not be needed here
name2id = pd.Series(index=temp['name'].values, 
                    data=temp['id'].values)

# Create the ID column in your DataFrame. There are other ways (join, etc.)
df['id'] = df['name'].map(name2id)
Peter Leimbigler
  • 10,775
  • 1
  • 23
  • 37
  • Thank you, I will test this and post results. 6x fewer SQL queries is a good start! – Evan May 01 '18 at 23:13
  • 1
    This worked with some minor modifications. Since some of my results are Empty Sets (SQL) or np.nan (Python float), using a pandas join on `name` worked best. I'd rather join on a numerical field but that's not always possible. Thanks much! – Evan May 02 '18 at 02:40