I have a below data frame
df = pd.DataFrame([['NY','R',1],
['NJ','Y',12],
['FL','B',20],
['CA','B',40],
['AZ','Y',51],
['NY','R',2],
['NJ','Y',18],
['FL','B',30],
['CA','B',20],
['AZ','Y',45],
['NY','Y',3],
['NJ','R',15],
['FL','R',10],
['CA','R',70],
['AZ','B',25],
['NY','B',4],
['NJ','B',17],
['FL','Y',30],
['CA','R',30],
['AZ','B',75],
['FL','R',5],
['FL','Y',25],
['NJ','R',14],
['NJ','B',11],
['NY','B',5],
['NY','Y',7]],
columns = ['State', 'ID','data'])
State ID data
0 NY R 1
1 NJ Y 12
2 FL B 20
3 CA B 40
4 AZ Y 51
5 NY R 2
6 NJ Y 18
7 FL B 30
8 CA B 20
9 AZ Y 45
10 NY Y 3
11 NJ R 15
12 FL R 10
13 CA R 70
14 AZ B 25
15 NY B 4
16 NJ B 17
17 FL Y 30
18 CA R 30
19 AZ B 75
20 FL R 5
21 FL Y 25
22 NJ R 14
23 NJ B 11
24 NY B 5
25 NY Y 7
What I want to do: to re-create a new data frame such that it only contains the smallest number from eachID of the state. For example: for State: NY and ID: R, there are 2 data: 1 and 2. The new dataframe will only take value: 1 for category State: NY and ID: R. The new data frame shall preferably look like this:
State dataR dataB dataY
0 NY 1.0 4 3.0
1 NJ 14.0 11 12.0
2 FL 5.0 20 25.0
3 CA 30.0 20 NaN
4 AZ NaN 25 45.0
Note that: State AZ and CA does not have any value (NaN) for column dataR and dataY, respectively, in the result because they originally have no such value in the original data frame. Please also note that the column in the result becomes dataR, dataB and dataY. I aim to create these columns in the result such that one can easily read the result later on in the actual data.
AND: I also want to be flexible such that I can seek for minimum value of each State in data per each ID R&Y together and B, so the new data frame will look like:
State dataRY dataB
0 NY 1 4
1 NJ 12 11
2 FL 5 20
3 CA 30 20
4 AZ 45 25
I tried using for loops as below:
colours = [['R'],['B'],['Y']]
def rearranging(df):
df_result = []
for c in colours:
df_colours = df[df['ID'].isin(c)]
df_colours_result = []
for state in np.unique(df['State'].values):
df1 = df_colours[df_colours['State'] == state]
df2 = df1.nsmallest(1,'data')
df_colours_result.append(df2)
first_loop_result = pd.concat(df_colours_result,ignore_index = True, sort = False)
df_result.append(first_loop_result)
final_result = pd.concat(df_result, axis = 1)
return final_result
The variable colours should be there because I want to be flexible and such that I can change their values if the data source changes later in the time.
And the result of the above for loop is:
State ID data State ID data State ID data
0 CA R 30.0 AZ B 25 AZ Y 45.0
1 FL R 5.0 CA B 20 FL Y 25.0
2 NJ R 14.0 FL B 20 NJ Y 12.0
3 NY R 1.0 NJ B 11 NY Y 3.0
4 NaN NaN NaN NY B 4 NaN NaN NaN
I don't like my result because: it is hard to read and I need to re-arrange and rename the columns again. Is there anyway to get the result that I actually aim at the above by using for loops? Vectorization is also welcome.
Please also be informed (once again) that I also want to be flexible on the column ID. This is the reason for me to include for example I want to say I need to see the smallest value of data for each State for ID R&Y combined and ID B. In my attempt I simply alter the code as below with the loop stays the same:
colours = [['R','Y'],['B']]
And the result is:
State ID data State ID data
0 AZ Y 45 AZ B 25
1 CA R 30 CA B 20
2 FL R 5 FL B 20
3 NJ Y 12 NJ B 11
4 NY R 1 NY B 4
Note: in comparison and if there is NaN, then NaN is simply ignored (and not treated as Zero).
Once again the result is not the same as what I aim for and this table is not informative enough.