2

I have two dataframes, one main one that I work with and an auxiliary one that I want to bring info in from.

df1 (main) contains a Reporter column with various name strings.

df2 (additional information) contains the reporter name and their location.

I want the location column added as a new column in df1.

I can do the following as a one off with:

df1 = pd.merge(df1, df2, on='Reporter', how='left')

and it works.

My problem is I run a frequently updating script (checking for new rows and checking for updates on old rows) and running this line of code repeatedly adds multiple columns for each execution.

  • The trouble with just checking if the column exists is that a new row (that contains a new reporter name) may have been added to the df that I DO want to know/update the location of.

  • Am I going about this the right way? Or should I do some sort of dict lookup and conditionally map the location each time? How can I do that in pandas?

jpp
  • 159,742
  • 34
  • 281
  • 339
swifty
  • 1,182
  • 1
  • 15
  • 36

3 Answers3

2

The problem with pd.merge is it always adds suffixes to the result.

Don't use it.

Use the equivalent DataFrame.join or DataFrame.merge on the one you're doing the LEFT JOIN on.

df1 = df1.join(df2, on='Reporter', how='left')

Or,

df1 = df1.merge(df2, on='Reporter', how='left')

The common column is only retained once in the result. Other, identical columns between the two (if you have them) are duplicated (with suffixes), however.

cs95
  • 379,657
  • 97
  • 704
  • 746
2

For a single series, I often prefer to map a series rather than merge or join (both also valid options, if used correctly).

Note I assume you have already applied str conversion to all columns as appropriate.

s = df2.set_index('Reporter')['Reporter Location']

df['Reporter Location'] = df['Reporter'].map(s)

In general, there is no need for converting your series to dict or for using zip on pd.Series objects. Using zip with numpy arrays can be inefficient.

jpp
  • 159,742
  • 34
  • 281
  • 339
0

Thanks for the info coldspeed but I went with the second option, mapping the dict.

reporter_dict = dict(zip(df2['Reporter'].astype(str),df2['Reporter Location']))
df['Reporter Location'] = df['Reporter'].astype(str).map(reporter_dict)
swifty
  • 1,182
  • 1
  • 15
  • 36