0

I want to create a column based on a group and threshold for cutoff from another column for each group of the grouped column.

The dataframe is below:

     df_in -> 

       unique_id  myvalue identif
   0      CTA15   19.0     TOP
   1      CTA15   22.0     TOP
   2      CTA15   28.0     TOP
   3      CTA15   18.0     TOP
   4      CTA15   22.4     TOP
   5      AC007    2.0     TOP
   6      AC007    2.3    SDME
   7      AC007    2.0    SDME
   8      AC007    5.0    SDME
   9      AC007    3.0    SDME
   10     AC007   31.4    SDME
   11     AC007    4.4    SDME
   12      CGT6    9.7    BTME
   13      CGT6   44.5    BTME
   14      TVF5    6.7    BTME
   15      TVF5    9.1    BTME
   16      TVF5   10.0    BTME
   17      BGD1    1.0    BTME
   18      BGD1    1.6     NON
   19       GHB   51.0     NON
   20       GHB   54.0     NON
   21       GHB    4.7     NON

So I have created a dictionary based on each group of the 'identif' column as :

    md = {'TOP': 22, 'SDME': 10, 'BTME': 20, 'NON':20}

So my goal is to create a new column, say 'chk', based on the following condition:

If the "identif" column matches the key in the dictionary "md" and the value for that key is >= than the corresponding value in the "myvalue" column then I will have 1, otherwise 0.

However, I am trying to find a good way using map/groupby/apply to create the new output data frame. I am now doing a very inefficient way ( which is taking considerable time on real data of million rows) using a function as follows:

 def myfilter(df, idCol, valCol, mydict):

   for index,row in df.iterrows():
      for key, value in mydict.items(): 
         if row[idCol] == key and row[valCol] >= value:
             df['chk'] = 1
         elif row[idCol] == key and row[valCol] < value:
             df['chk'] = 0

  return df

Getting the output via the following call:

     df_out = myfilter(df_in, 'identif', 'myvalue', md)

So my output will be like:

     df_out ->

         unique_id  myvalue identif  chk
    0      CTA15     19.0     TOP    0
    1      CTA15     22.0     TOP    1
    2      CTA15     28.0     TOP    1
    3      CTA15     18.0     TOP    0
    4      CTA15     22.4     TOP    1
    5      AC007      2.0     TOP    0
    6      AC007      2.3    SDME    0
    7      AC007      2.0    SDME    0
    8      AC007      5.0    SDME    0
    9      AC007      3.0    SDME    0
    10     AC007     31.4    SDME    1
    11     AC007      4.4    SDME    0
    12      CGT6      9.7    BTME    0
    13      CGT6     44.5    BTME    1
    14      TVF5      6.7    BTME    0
    15      TVF5      9.1    BTME    0
    16      TVF5     10.0    BTME    0
    17      BGD1      1.0    BTME    0
    18      BGD1      1.6     NON    0
    19       GHB     51.0     NON    1
    20       GHB     54.0     NON    1
    21       GHB      4.7     NON    0

This works but extremely inefficient and would like a much better way to do it.

Seanny123
  • 8,776
  • 13
  • 68
  • 124
Stan
  • 786
  • 1
  • 9
  • 25
  • What you're looking for is to apply a function to all your df. Have you check for https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html – pablora Jan 29 '19 at 17:14

2 Answers2

1

First, you're traversing your dataset four times total, for each row in the data frame you're traversing every element in your dictionary. You can change your function to traverse it once instead. This will speed it up your original function. Try something like:

def myfilter(df, idCol, valCol, mydict):

    for index,row in df.iterrows():
        value = mydict.get(row[idCol])
        if row[valCol] >= value:
            df['chk'] = 1
        else:
            df['chk'] = 0

    return df
Rachel
  • 336
  • 1
  • 5
  • This is not working as it does not make sure that the idCol matches with the dictionary key and also the corresponding. I need both the conditions to be satisfied. Your code makes sure that only myvalue criterion is satisfied. – Stan Jan 29 '19 at 17:38
  • This code picks the value from your dictionary based on the `idCol`. You don't have to traverse the entire dictionary to do that. The dictionary hashes the key and returns the value associated with that key if one exists. It returns a value of `None` if no key exists. So, checking that the `idCol` matches is redundant. This is just to improve performance without using apply. Although, apply may be what you want. – Rachel Jan 29 '19 at 17:47
1

This should be faster:

def func(identif, value):
    if identif in md:
        if value >= md[identif]:
            return 1.0
        else:
            return 0.0
    else:
        return np.NaN

df['chk'] = df.apply(lambda row: func(row['identif'], row['myvalue']), axis=1)

The timing on this little example:

CPU times: user 1.64 ms, sys: 73 µs, total: 1.71 ms
Wall time: 1.66 ms

Your version timing:

CPU times: user 8.6 ms, sys: 1.92 ms, total: 10.5 ms
Wall time: 8.79 ms

Although on such a small example it's not conclusive.

godfryd
  • 582
  • 3
  • 16
  • Thanks. I think this is much much better. How can I modify your function and apply the dict as an argument, like: def func(identif, value, md): ?? How will the apply function change in that case. – Stan Jan 29 '19 at 17:42
  • well just pass it as an argument, like so `df['chk'] = df.apply(lambda row: func(row['identif'], row['myvalue'], md), axis=1)` – godfryd Jan 29 '19 at 17:44
  • Perfect!! Works like a charm! Thanks – Stan Jan 29 '19 at 17:55