0

I have a very large data set (about 600,000 rows). I want to reduce the number of rows of data by creating an array in the last column when the first 4 columns are the same.

      make  year      model          engine            part
alfa romeo  1960  giulietta         1.3l l4             A
alfa romeo  1958  giulietta         1.3l l4             B
alfa romeo  1958  giulietta         1.3l l4             A
alfa romeo  1957  giulietta         1.3l l4             B
alfa romeo  1957  giulietta         1.3l l4             A
alfa romeo  1956  giulietta         1.3l l4             B
alfa romeo  1956  giulietta         1.3l l4             A
alfa romeo  1954  giulietta         1.3l l4             B
alfa romeo  1954  giulietta         1.3l l4             A
alfa romeo  1955  giulietta         1.3l l4             B
alfa romeo  1955  giulietta         1.3l l4             A

Desired output:

      make  year      model          engine            part
alfa romeo  1960  giulietta         1.3l l4            [A]
alfa romeo  1958  giulietta         1.3l l4            [A,B]
alfa romeo  1957  giulietta         1.3l l4            [A,B]
alfa romeo  1956  giulietta         1.3l l4            [A,B]
alfa romeo  1955  giulietta         1.3l l4            [A,B]
alfa romeo  1954  giulietta         1.3l l4            [A,B]

I was thinking I would be able to use dataframe.groupby to obtain my desired output, but I was unable to through multiple attempts. I kept receiving a form of the following output <pandas.core.groupby.generic.DataFrameGroupBy object at xxx>.

Any help would be greatly appreciated!

Stephen
  • 101
  • 1
  • 8

2 Answers2

2

Group them together and make a list of their contents.

df.groupby(['make', 'year', 'model', 'engine']).agg(list).reset_index()


make    year    model   engine  part
0   alfa romeo  1954    giulietta   1.3l l4 [B, A]
1   alfa romeo  1955    giulietta   1.3l l4 [B, A]
2   alfa romeo  1956    giulietta   1.3l l4 [B, A]
3   alfa romeo  1957    giulietta   1.3l l4 [B, A]
4   alfa romeo  1958    giulietta   1.3l l4 [B, A]
5   alfa romeo  1960    giulietta   1.3l l4 [A]
r-beginners
  • 31,170
  • 3
  • 14
  • 32
1

You can group then make a list of parts.

df = df.groupby(['make', 'year', 'model', 'engine'])['part'].apply(','.join).reset_index()

Sample output:

         make  year      model   engine part
0  alfa romeo  1957  giulietta  1.3l l4  B,A
1  alfa romeo  1958  giulietta  1.3l l4  B,A
2  alfa romeo  1960  giulietta  1.3l l4    A
thorntonc
  • 2,046
  • 1
  • 8
  • 20