2

I have a single dataframe containing an ID column id, and I know that the ID will exist either exactly in one row ('mismatched') or two rows ('matched') in the dataframe.

  • In order to select the mismatched rows and the pairs of matched rows I can use a groupby on the ID column.
  • Now for each group, I want to take some columns from the second (pair) row, rename them, and copy them to the first row. I can then discard all the second rows and return a single dataframe containing all the modified first rows (for each and every group).
  • Where there is no second row (mismatched) - it's fine to put NaN in its place.

To illustrate this see table below id=1 and 3 are a matched pair, but id=2 is mismatched:

entity id partner value
A      1  B       200
B      1  A       300
A      2  B       600
B      3  C       350
C      3  B       200

The resulting transformation should leave me with the following:

entity id partner entity_value partner_value
A      1  B       200          300
A      2  B       600          NaN
B      3  C       350          200

What's baffling me is how to come up with a generic way of getting the matching partner_value from row 2, copied into row 1 after the groupby, in a way that also works when there is no matching id.

smci
  • 32,567
  • 20
  • 113
  • 146
Phil
  • 592
  • 6
  • 15
  • Please post reproducible data and code rather than just a text dump of the dataframe. – smci Jun 21 '19 at 21:44

3 Answers3

3

Solution (this was tricky):

dfg = df.groupby('id', sort=False)

# Create 'entity','id','partner','entity_value' from the first row...
df2 = dfg['entity','id','partner','value'].first().rename(columns={'value': 'entity_value'})

# Now insert 'partner_value' from those groups that have a second row...
df2['partner_value'] = nan
df2['partner_value'] = dfg['value'].nth(n=1)

   entity  id partner  entity_value  partner_value
id                                                
1       A   1       B           200          300.0
2       A   2       B           600            NaN
3       B   3       C           350          200.0

This was tricky to get working. The short answer is that although pd.groupby(...).agg(...) in principle allows you to specify a list of tuples of (column, aggregate_function), and you could then chain those into a rename, that won't work here since we're trying to do two separate aggregate operations both on value column, and rename both their results (you get pandas.core.base.SpecificationError: Function names must be unique, found multiple named value).

Other complications:

  • We can't directly use groupby.nth(n) which sounds useful at first glance, except it's only on a DataFrame not a Series like df['value'], and also it silently drops groups which don't have an n'th element, not what we want. (But it does keep the index, so we can use it by first initializing the column as all-NaNs, then selectively inserting on that column, as above).
  • In any case the pd.groupby.agg() syntax won't even let you call nth() by just passing 'nth' as the agg_func name, since nth() is missing its n argument; you'd have to declare a lambda.
  • I tried defining the following function second_else_nan to use inside an agg() as above, but after much struggling I couldn't get this as this to work for multiple reasons, only one of which is you can't do two aggs on the same column:

Code:

def second_else_nan(v):
    if v.size == 2:
        return v[1]
    else:
        return pd.np.nan

(i.e. the equivalent on a list of the dict.get(key, default) builtin)

smci
  • 32,567
  • 20
  • 113
  • 146
1

I would do that. First, get the first value:

df_grouped = df.reset_index().groupby('id').agg("first")

Then retrieve the values that are duplicated and insert them:

df_grouped["partner_value"] = df.groupby("id")["value"].agg("last")

The only thing is that you have a repeated value in case it's not duplicated (instead of a NaN).

Nakor
  • 1,484
  • 2
  • 13
  • 23
0

What about something like this?

grouped = df.groupby("id")
first_values = grouped.agg("first")
sums = grouped.agg("sum")
first_values["partner_value"] = sums["value"] - first_values["value"]
first_values["partner_value"].replace(0, np.nan, inplace=True)

transformed_df = first_values.copy()

Group the data by id, take the first row, take the sum of the 'value' column for each group, from this subtract 'value' from the first row. Then replace 0's in the resulting column with np.nan (making the assumption here that data from the 'value' column is never 0)

AndrewH
  • 234
  • 1
  • 6