I have a large dataset with over 1.5 Million rows and some columns - one of which contains an ID for every observation. I would like to only extract the rows with the same ID to a new dataframe or restructure the original dataframe that it fits the criteria. I've tried merging but couldn't quite make it work. Checking for duplicates also takes forever.
Does anyone know of an efficient way in terms of time complexity to filter these rows from the dataset?
Original dataframe looks similar to this:
id | attribute | value |
---|---|---|
1 | attribute 1 | value 1 |
1 | attribute 2 | value 2 |
1 | attribute 3 | value 3 |
2 | attribute 1 | value 1 |
2 | attribute 2 | value 2 |
The result should ideally look like this:
id | attribute 1 | attribute 2 | attribute 3 |
---|---|---|---|
1 | value 1 | value 2 | value 3 |
2 | value 1 | value 2 | NA |