I have a file which has multiple values for each phone number. for example :
phone_no circle operator priority1 attribute1 attribute2 attribute3 priority2 attribute1 attribute2 attribute3
123445 delhi airtel 1.0 info1 info2 info3 1.1 info4 info5 info6
987654 bhopal idea 1.1 info1 info2 info3 1.4 info4 info5 info6
123445 delhi airtel 1.3 info1 info2 info3 1.0 info4 info5 info6
What my expected output is : for each phone number select minimum P1 and it's corresponding attribute values.
As my above example states that for phone number 123445 has P1 in line 1 which is lesser than P1 in line 3 (1.0 < 1.3) so I want to select attributes 1,2 and 3 from line 1 and as P2 in line 3 has lesser value ( 1.0 < 1.1) so I want to select attribute values from line 3.
Here is what I want in a tabular format:
phone_no circle operator priority1 attribute1 attribute2 attribute3 priority2 attribute1 attribute2 attribute3
123445 delhi airtel 1.0 info1 info2 info3 1.0 info4 info5 info6
987654 bhopal idea 1.1 info1 info2 info3 1.4 info4 info5 info6
I have 25 different priority values and each priority value has 4 different attributes so my total columns are around 125.
What I have tried so far :
- Create a Dataframe which has a phone number as a key and minimum of each priority value.
- Create another Dataframe which has values of min(Priority1) along with it's corresponding attributes for each phone number.
- Create another Dataframe which has values of min(Priority2) along with it's corresponding attributes for each phone number.
- Join these two Dataframes on Phone number to get the complete information and save this Dataframe to disk.
The problem with my approach is that it is not a good approach considering the amount of columns that I have. Please suggest me some good approach for this problem.
EDIT 1 : Here is the pastebin link of what I have done : https://pastebin.com/ps4f1KSh