3

I have a below data frame

df = pd.DataFrame([['NY','R',1],
                  ['NJ','Y',12],
                  ['FL','B',20],
                  ['CA','B',40],
                  ['AZ','Y',51],
                  ['NY','R',2],
                  ['NJ','Y',18],
                  ['FL','B',30],
                  ['CA','B',20],
                  ['AZ','Y',45],
                  ['NY','Y',3],
                  ['NJ','R',15],
                  ['FL','R',10],
                  ['CA','R',70],
                  ['AZ','B',25],
                  ['NY','B',4],
                  ['NJ','B',17],
                  ['FL','Y',30],
                  ['CA','R',30],
                  ['AZ','B',75],
                  ['FL','R',5],
                  ['FL','Y',25],
                  ['NJ','R',14],
                  ['NJ','B',11],
                  ['NY','B',5],
                  ['NY','Y',7]],
                    columns = ['State', 'ID','data'])


State ID  data
0     NY  R     1
1     NJ  Y    12
2     FL  B    20
3     CA  B    40
4     AZ  Y    51
5     NY  R     2
6     NJ  Y    18
7     FL  B    30
8     CA  B    20
9     AZ  Y    45
10    NY  Y     3
11    NJ  R    15
12    FL  R    10
13    CA  R    70
14    AZ  B    25
15    NY  B     4
16    NJ  B    17
17    FL  Y    30
18    CA  R    30
19    AZ  B    75
20    FL  R     5
21    FL  Y    25
22    NJ  R    14
23    NJ  B    11
24    NY  B     5
25    NY  Y     7

What I want to do: to re-create a new data frame such that it only contains the smallest number from eachID of the state. For example: for State: NY and ID: R, there are 2 data: 1 and 2. The new dataframe will only take value: 1 for category State: NY and ID: R. The new data frame shall preferably look like this:

  State  dataR  dataB  dataY
0    NY    1.0      4    3.0
1    NJ   14.0     11   12.0
2    FL    5.0     20   25.0
3    CA   30.0     20    NaN
4    AZ    NaN     25   45.0

Note that: State AZ and CA does not have any value (NaN) for column dataR and dataY, respectively, in the result because they originally have no such value in the original data frame. Please also note that the column in the result becomes dataR, dataB and dataY. I aim to create these columns in the result such that one can easily read the result later on in the actual data.

AND: I also want to be flexible such that I can seek for minimum value of each State in data per each ID R&Y together and B, so the new data frame will look like:

  State  dataRY  dataB
0    NY       1      4
1    NJ      12     11
2    FL       5     20
3    CA      30     20
4    AZ      45     25

I tried using for loops as below:

colours = [['R'],['B'],['Y']]

def rearranging(df):
    df_result = []
    for c in colours:
        df_colours          = df[df['ID'].isin(c)]
        df_colours_result   = []
        for state in np.unique(df['State'].values):
            df1     = df_colours[df_colours['State'] == state]
            df2     = df1.nsmallest(1,'data')
            df_colours_result.append(df2)
        first_loop_result = pd.concat(df_colours_result,ignore_index = True, sort = False)
        df_result.append(first_loop_result)
    final_result = pd.concat(df_result, axis = 1)
    return final_result

The variable colours should be there because I want to be flexible and such that I can change their values if the data source changes later in the time.

And the result of the above for loop is:

  State   ID  data State ID  data State   ID  data
0    CA    R  30.0    AZ  B    25    AZ    Y  45.0
1    FL    R   5.0    CA  B    20    FL    Y  25.0
2    NJ    R  14.0    FL  B    20    NJ    Y  12.0
3    NY    R   1.0    NJ  B    11    NY    Y   3.0
4   NaN  NaN   NaN    NY  B     4   NaN  NaN   NaN

I don't like my result because: it is hard to read and I need to re-arrange and rename the columns again. Is there anyway to get the result that I actually aim at the above by using for loops? Vectorization is also welcome.

Please also be informed (once again) that I also want to be flexible on the column ID. This is the reason for me to include for example I want to say I need to see the smallest value of data for each State for ID R&Y combined and ID B. In my attempt I simply alter the code as below with the loop stays the same:

colours = [['R','Y'],['B']]


And the result is:

  State ID  data State ID  data
0    AZ  Y    45    AZ  B    25
1    CA  R    30    CA  B    20
2    FL  R     5    FL  B    20
3    NJ  Y    12    NJ  B    11
4    NY  R     1    NY  B     4

Note: in comparison and if there is NaN, then NaN is simply ignored (and not treated as Zero).

Once again the result is not the same as what I aim for and this table is not informative enough.

teteh May
  • 455
  • 2
  • 11
  • 1
    This is a `pivot`. `df.pivot_table(values='data', index='State', columns='ID', aggfunc='min').add_prefix('data_')` – piRSquared Nov 26 '19 at 17:17
  • right, i didnt think hard enough, this is a pivot problem you can follow what @piRSquared said:) – anky Nov 26 '19 at 17:20
  • 1
    @anky_91 I'm fine with leaving your answer. I was just about to comment on it. OP: anky_91 had a `groupby` solution in which they used `sort=False`. That provided the same calculation except it also preserved the order of the rows. Their solution matched your expected results exactly. – piRSquared Nov 26 '19 at 17:23
  • thanks for the input. But is there any possibility for me to be flexible on the 'ID'? For example: I still want to use State as index, but for the column ID I will include Y and R together as one group, so now I look for the minimum value in column data based for each State and for each Y and R together and each B? So now the result will only consists of dataYR and dataB? Thanks. – teteh May Nov 26 '19 at 17:25
  • 1
    That seems to be a different question. There are several ways to do this. I'll show you one way using `replace` ahead of the same `pivot_table`. `df.replace({'ID': {'Y|R': 'Y_R'}}, regex=True).pivot_table(values='data', index='State', columns='ID', aggfunc='min').add_prefix('data_')` – piRSquared Nov 26 '19 at 17:29
  • 1
    you mean `df.assign(ID=df['ID'].replace(['Y','R'],'YR')).groupby(['State','ID'],sort=False)['data'].min().unstack().add_prefix('data_')` ? – anky Nov 26 '19 at 17:29
  • @tetehMay If you [edit] that into your question, I'll reopen it. Let me know if you do. – piRSquared Nov 26 '19 at 17:32
  • hi piRSquared, thanks for referring me to link for pivot table, I need it as well. However, as advised, I also alter my questions. Thanks. – teteh May Nov 27 '19 at 08:50

1 Answers1

3

IIUC, use groupby() on State and ID and get min of data column, and unstack(add_prefix) if required.:

df.groupby(['State','ID'],sort=False)['data'].min().unstack().add_prefix('data_')

ID     data_R  data_Y  data_B
State                        
NY        1.0     3.0     4.0
NJ       14.0    12.0    11.0
FL        5.0    25.0    20.0
CA       30.0     NaN    20.0
AZ        NaN    45.0    25.0

EDIT: As requested by OP , if you want to merge Y and R together, just replace and do similar:

(df.assign(ID=df['ID'].replace(['Y','R'],'YR'))
    .groupby(['State','ID'],sort=False)['data'].min().unstack().add_prefix('data_'))

ID     data_YR  data_B
State                 
NY           1       4
NJ          12      11
FL           5      20
CA          30      20
AZ          45      25
anky
  • 74,114
  • 11
  • 41
  • 70
  • 1
    Note the use of `sort=False`. By default, Pandas sorts the resulting, recombined dataframe by the keys in which it used to group. If, instead, you'd like to keep the groups in the order they first appear, use `sort=False` – piRSquared Nov 26 '19 at 17:25