How do I select specific columns of a data frame, and sum them based on a condition?

Question

So here is an analogous situation of what I am trying to do

data = pd.read_csv(data)
df = pd.DataFrame(data)
print(df)

The data frame looks as follows

    ... 'd1' 'd2' 'd3... 'd13'
 0  ...   0    0    0...   0
 1  ...   0    0.95    0...   0
 2  ...   0    0.95    0.95...  0

So on and so forth, essentially I would like to select these last 13 columns of my data frame, and count how many per row are greater than a certain value, and then append that to my data frame.

I figure there must be a simple way, I have been trying to use df.iloc[:, 21:] as my first column of interest begins here, however from this point on, I feel stuck. I have been trying many different methods such as criteria and for loops. I know this is a trivial thing but I have spent hours on it. Any help would be much appreciated.

for x in df:
    a = df.iloc[:,21:].values()
    if a.any[:, 12] > 0.9:
        a[x] = 1
    else:
        a[x] = 0
sumdi = sum(a)
df.append(sumdi)

jezrael · Accepted Answer · 2018-11-14T10:31:56.127

I believe you need compare last 13 columns selected by iloc with gt (>), count True values by sum and cast to integers:

df['new'] = df.iloc[:,-13:].gt(0.9).sum(axis=1).astype(int)

Sample:

np.random.seed(12)
df = pd.DataFrame(np.random.rand(10, 6))

#compare last 3 columns for > 0.5
df['new'] = df.iloc[:,-3:].gt(.5).sum(axis=1).astype(int)
print (df)
          0         1         2         3         4         5  new
0  0.154163  0.740050  0.263315  0.533739  0.014575  0.918747    2
1  0.900715  0.033421  0.956949  0.137209  0.283828  0.606083    1
2  0.944225  0.852736  0.002259  0.521226  0.552038  0.485377    2
3  0.768134  0.160717  0.764560  0.020810  0.135210  0.116273    0
4  0.309898  0.671453  0.471230  0.816168  0.289587  0.733126    2
5  0.702622  0.327569  0.334648  0.978058  0.624582  0.950314    3
6  0.767476  0.825009  0.406640  0.451308  0.400632  0.995138    1
7  0.177564  0.962597  0.419250  0.424052  0.463149  0.373723    0
8  0.465508  0.035168  0.084273  0.732521  0.636200  0.027908    2
9  0.300170  0.220853  0.055020  0.523246  0.416370  0.048219    1

Using apply is slow, because there are loops under the hood:

np.random.seed(12)
df = pd.DataFrame(np.random.rand(10000, 20))

In [172]: %timeit df['new'] = df.iloc[:,-13:].gt(0.9).sum(axis=1).astype(int)
3.46 ms ± 91.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [173]: %timeit df['new'] = df[df.columns[-13:]].apply(lambda x: x > .9, axis=1).sum(axis=1)
1.57 s ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I may be missing something here, how do i bring this down my index, so that it iterates through every row, it seems to be functioning for the first, but just repeats that value for all rows. — Victor Nogueira, Nov 14 '18 at 10:28
@VictorNogueira - dont use it, because really slow... in pandas is best avoid all loops and use only vectorized solutions if exist. — jezrael, Nov 14 '18 at 10:28
@VictorNogueira - check [this](https://stackoverflow.com/questions/24870953/does-iterrows-have-performance-issues/24871316#24871316) — jezrael, Nov 14 '18 at 10:31
@jezrael I am so thankful for your help. IT WORKED! I spent so long trying to do this. Should have just posted here. Thank you, thank you so much. — Victor Nogueira, Nov 14 '18 at 10:41
@Jezrael Sure thing! I accepted it. If you don't mind, could you explain what was wrong with my code? Just so I gain a fuller understanding of the code. I am somewhat of new comer to python, but the power and efficiency I have seen after using it for one week of data manipulation is insane. Thanks in advance. — Victor Nogueira, Nov 15 '18 at 19:08

score 1 · Answer 2 · answered Nov 14 '18 at 10:26

Yes, you'll want to apply row-wise functions.

# Select subset of columns
cols = df1.iloc[:, -13:].columns
# Create new column based on conditions that value is greater than 1
df1['new'] = df1[cols].apply(lambda x: x > 1, axis=1).sum(axis=1)

Under the hood this is doing the same as @jezrael answer, just slightly different syntax. gt() is being replaced with an applied lambda. This just offers slightly more flexibility for other conditions/cases where your logic is more complex.

Note: axis=1 is an important condition to ensure your function is being applied per row. You can change to axis=0 to do on a column-by-column basis.

How do I select specific columns of a data frame, and sum them based on a condition?

2 Answers2