0

I have a table (pandas dataframe) as follows.

id y val1 val2 val3 ...
1 100 3 1 2
1 150 1 2 4
1 250 4 2 6
2 200 3 1 4
2 250 2 2 2
2 350 4 2 4
3 200 3 3 4
3 300 3 2 4
3 400 6 3 3

I want to aggregate by the ID and concatenate all the values in order, and then take only the last y of each id. For example, the table would look like this:

id y val1 val2 val3 ... val1 val2 val3 ... val1 val2 val3 ...
1 250 3 1 2 ... 1 2 4 ... 4 2 6 ...
2 350 3 1 4 ... 2 2 2 ... 4 2 4 ...
3 400 3 3 4 ... 3 2 4 ... 6 3 3 ...

Notice the y, 250 is from the last row of id=1 (in the real table there is another id I can use to specify what y I want to keep), 350 is from the last row of id=2 and 400 is from the last row of id=3. The values are simply concatenated into one row in order.

I looked into pivot_table and know I can use new_ds = pd.pivot_table(dataset, index='id') to aggregate by the ID, but I want to be selective (not including y for all rows, and in reality there's other garbage data I do not want). It is important that the values have to be in order, so the values from the second row come after the first, third comes the second etc. There's more than 100 values.

I've looked into pivot and groupby but can't figure out exactly how to apply to this.

smci
  • 32,567
  • 20
  • 113
  • 146
Andy
  • 167
  • 9
  • 2
    That isn't a pivot table, just a funky groupby. And you're doing horizontal `pd.concat(..., axis=1)`. You could probably get away with `df.groupby(['id','y'])`, your example doesn't show enough to see. – smci Apr 06 '22 at 11:12
  • ...so you want to `groupby('id')`, pick the last 'y' value, but horizonally concatenate the other columns `val1, val2, val3, ...`. You should attempt this yourself in code, post your code here when you get stuck. It shouldn't be too hard. – smci Apr 06 '22 at 11:15
  • Ok you want to pick the last 'y' value. – smci Apr 06 '22 at 11:18
  • 1
    I don't see how you could have the same column name multiple times in a dataframe, unless you pre- or postfix it. – DocZerø Apr 06 '22 at 11:20
  • @Actually, ideally I want the number to continue, so val101 after val100 instead of resetting to 1, but the col names do not matter too much. – Andy Apr 06 '22 at 11:22
  • Yes @DocZero is right, please tell us clearly how the duplicate column names should look in the output, your example is illegal syntax: Do you want a letter or number suffix `col1a, col2a..., col1b, col2b...`? – smci Apr 06 '22 at 11:51
  • @smci I really do not care about the name of these columns, as long as they do not get mixed up with y. If I could choose though, I would have the values of row 2 pick up where it was left. If row 1 has {val1 ... val100}, it should immediately follow with {val101-val200} from row2. In the end, the header would be like 'id, y, val1... val10000' – Andy Apr 06 '22 at 11:59
  • Hey Andy, SO isn't a code-writing service, you really are supposed to **post your own code attempt** ([Welcome to StackOverflow](https://stackoverflow.com/help)). I've already posted 90% of the solution below, I solved the hard parts. The duplicate column name prefixing should be easy, so please attempt that yourself, post your code and output and show us where you got stuck. Search the existing Q&A, that's surely got an existing duplicate solution; if you find it, cite it here by URL. – smci Apr 06 '22 at 12:02
  • ...and if the names really are just consecutive `'val1' ... 'val100', 'val101', ...` you can define all those column-names automatically, and ignore the column-names in the input. Anyway please solve that one yourself. – smci Apr 06 '22 at 12:08

1 Answers1

2

This isn't a pivot table, but a funky groupby, or maybe two separate groupby's.

(I wouldn't try to use .agg() either because you want to concatenate the other columns in-order, all together, but .agg() is really pedantic about forcing you to define an individual aggregate function for each column, which here would be a pain.)

Taking the last 'y' value in a group is easy:

df.groupby('id').agg({'y': lambda s: s.iloc[-1]})

# where we don't use .tail() to avoid the current bug on a series which throws "ValueError: Must produce aggregated value"

Now to vertically concatenate the rows in the group consecutively, for all the other columns:

  • we actually don't even need pd.concat([...], axis=1) like I thought we would

  • we can apply this solution inside the df.groupby('id').apply(lambda g: g.drop(columns=['id','y']).values.flatten())

  • first, explicitly specify which columns you do want included:

    df[['id','val1', 'val2', 'val3']].groupby('id').apply(lambda g: g.values.flatten())

    id
    1    [3, 1, 2, 1, 2, 4, 4, 2, 6]
    2    [3, 1, 4, 2, 2, 2, 4, 2, 4]
    3    [3, 3, 4, 3, 2, 4, 6, 3, 3]

or if you prefer, you can move the .drop('y') to the front:

df.drop(columns='y').groupby('id').apply(lambda g: g.values.flatten()

We can't legally concatenate to have duplicate column names in the output as @DocZerø pointed out, your example is illegal pandas syntax. You need to figure out how you want to add a prefix/suffix/other name-mangling to the column names.

Minor note: pandas .values accessor is discouraged and will in future be deprecated, we're supposed to start using to_numpy() or .array.

smci
  • 32,567
  • 20
  • 113
  • 146
  • `values` is **not** deprecated, just not encouraged to use if you want to ensure having a copy or view ;) – mozway Apr 06 '22 at 11:44
  • @mozway: ok, "strongly discouraged, not yet deprecated, planned to be deprecated in future" See [0.24 Whatsnew: "Accessing the values in a Series or Index"](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#whatsnew-0240-values-api). *We haven’t removed or deprecated `Series.values` or `DataFrame.values`, but we highly recommend and using `.array` or `.to_numpy()` instead* – smci Apr 06 '22 at 11:48
  • Yes, that's the changelog I had in mind ;) +1 btw – mozway Apr 06 '22 at 11:52
  • Andy: better to use `df[['id','val1', 'val2', 'val3']].groupby('id').apply(lambda g: g.values.flatten())` then. Explicitly specify which columns you do want included. Updated the answer. – smci Apr 06 '22 at 12:46
  • Thanks, I know you said we can't legally concatenate into separate columns, but can we somehow postprocess it? For instance I got this very long cell`[0. 0.27472527 0. 1.61538462 0. 0.6043956\n 0. 0.04395604 0. 2.0989011 0.38461538 1.68131868\n 0. 0. 0. 0. 0. 0.\n ...........` with your answer. I notice there's `\n` so it might be difficult to even separate with a delimiter. – Andy Apr 06 '22 at 13:18
  • Nevermind, those `\n`'s probably do not matter. I'll try to separate it with a space or something. Thanks. – Andy Apr 06 '22 at 13:23
  • @smci regarding finding the y with `df.groupby('id').agg({'y': lambda s: s.iloc[-1]})`, can we specify with another column instead of with the position `[-1]`? Say I have another column called `key`, and I want to get the y for a specific key value. In the original question I said the last to make it simple, and it is usually the last one I want, but sometimes it isn't. – Andy Apr 06 '22 at 13:36
  • Andy: the `.iloc[-1]` is a synonym for `.tail()`, to workaround the pandas bug with series I mentioned. You mean "specify another *row* instead of [-1]". Possibly, but again, give us some more specific examples than *"usually the last one I want, but sometimes it isn't"*. I suggest you post a new question, and link it here. – smci Apr 06 '22 at 22:01