Create an array in a column from rows with duplicate data

Question

I have a very large data set (about 600,000 rows). I want to reduce the number of rows of data by creating an array in the last column when the first 4 columns are the same.

      make  year      model          engine            part
alfa romeo  1960  giulietta         1.3l l4             A
alfa romeo  1958  giulietta         1.3l l4             B
alfa romeo  1958  giulietta         1.3l l4             A
alfa romeo  1957  giulietta         1.3l l4             B
alfa romeo  1957  giulietta         1.3l l4             A
alfa romeo  1956  giulietta         1.3l l4             B
alfa romeo  1956  giulietta         1.3l l4             A
alfa romeo  1954  giulietta         1.3l l4             B
alfa romeo  1954  giulietta         1.3l l4             A
alfa romeo  1955  giulietta         1.3l l4             B
alfa romeo  1955  giulietta         1.3l l4             A

Desired output:

      make  year      model          engine            part
alfa romeo  1960  giulietta         1.3l l4            [A]
alfa romeo  1958  giulietta         1.3l l4            [A,B]
alfa romeo  1957  giulietta         1.3l l4            [A,B]
alfa romeo  1956  giulietta         1.3l l4            [A,B]
alfa romeo  1955  giulietta         1.3l l4            [A,B]
alfa romeo  1954  giulietta         1.3l l4            [A,B]

I was thinking I would be able to use dataframe.groupby to obtain my desired output, but I was unable to through multiple attempts. I kept receiving a form of the following output <pandas.core.groupby.generic.DataFrameGroupBy object at xxx>.

Any help would be greatly appreciated!

r-beginners · Accepted Answer · 2020-08-10T05:41:09.683

2

Group them together and make a list of their contents.

df.groupby(['make', 'year', 'model', 'engine']).agg(list).reset_index()


make    year    model   engine  part
0   alfa romeo  1954    giulietta   1.3l l4 [B, A]
1   alfa romeo  1955    giulietta   1.3l l4 [B, A]
2   alfa romeo  1956    giulietta   1.3l l4 [B, A]
3   alfa romeo  1957    giulietta   1.3l l4 [B, A]
4   alfa romeo  1958    giulietta   1.3l l4 [B, A]
5   alfa romeo  1960    giulietta   1.3l l4 [A]

edited Aug 10 '20 at 05:41

answered Aug 10 '20 at 04:46

r-beginners

31,170
3
14
32

Thanks! This actually does the job too, but I do need to keep an index here. Is there a way to do that? – Stephen Aug 10 '20 at 05:09
1

Include `.reset_index()` at the end – thorntonc Aug 10 '20 at 05:10
Ahh perfect, should've tried that before asking. Thanks! – Stephen Aug 10 '20 at 05:13

thorntonc · Answer 2 · 2020-08-10T05:08:47.060

1

You can group then make a list of parts.

df = df.groupby(['make', 'year', 'model', 'engine'])['part'].apply(','.join).reset_index()

Sample output:

         make  year      model   engine part
0  alfa romeo  1957  giulietta  1.3l l4  B,A
1  alfa romeo  1958  giulietta  1.3l l4  B,A
2  alfa romeo  1960  giulietta  1.3l l4    A

edited Aug 10 '20 at 05:08

answered Aug 10 '20 at 04:42

thorntonc

2,046
1
8
20

I like this idea, but I'm getting the following error `AttributeError: 'function' object has no attribute 'transform'` – Stephen Aug 10 '20 at 04:48
1

@Stephen Try my edit – thorntonc Aug 10 '20 at 04:50
1

Yes, this works great and also removes the duplicate rows. Only change would be adding `'engine'` to the groupby. Thank you! – Stephen Aug 10 '20 at 05:06

Create an array in a column from rows with duplicate data

2 Answers2