I have a very large data set (about 600,000 rows). I want to reduce the number of rows of data by creating an array in the last column when the first 4 columns are the same.
make year model engine part
alfa romeo 1960 giulietta 1.3l l4 A
alfa romeo 1958 giulietta 1.3l l4 B
alfa romeo 1958 giulietta 1.3l l4 A
alfa romeo 1957 giulietta 1.3l l4 B
alfa romeo 1957 giulietta 1.3l l4 A
alfa romeo 1956 giulietta 1.3l l4 B
alfa romeo 1956 giulietta 1.3l l4 A
alfa romeo 1954 giulietta 1.3l l4 B
alfa romeo 1954 giulietta 1.3l l4 A
alfa romeo 1955 giulietta 1.3l l4 B
alfa romeo 1955 giulietta 1.3l l4 A
Desired output:
make year model engine part
alfa romeo 1960 giulietta 1.3l l4 [A]
alfa romeo 1958 giulietta 1.3l l4 [A,B]
alfa romeo 1957 giulietta 1.3l l4 [A,B]
alfa romeo 1956 giulietta 1.3l l4 [A,B]
alfa romeo 1955 giulietta 1.3l l4 [A,B]
alfa romeo 1954 giulietta 1.3l l4 [A,B]
I was thinking I would be able to use dataframe.groupby
to obtain my desired output, but I was unable to through multiple attempts. I kept receiving a form of the following output <pandas.core.groupby.generic.DataFrameGroupBy object at xxx>
.
Any help would be greatly appreciated!