2

Am trying to count the number of rows it takes for a certain row value 'Neg' to become 1 from its default 0 and capture this counts whereever Neg = 1 in a new column called 'dsf'. I tried the following code snippet, I'm not sure why but this puts 0 for all 'dsf' values.

Why is this wrong?

/code

full_data['dsf'] = 0
counter = 0
for i,r in full_data.iterrows():
    if r['neg'] == 0:
        counter+=1
        r['dsf'] = 0
    else:
        r['dsf'] = counter
        counter = 0
full_data

current output:

    datehour            pft     rev         mgn        neg  dsf
0   2018-04-01 00:00:00 53.1783 110.8514    0.479726    0   0
1   2018-04-01 00:30:00 51.1496 105.9060    0.482972    0   0
2   2018-04-01 01:00:00 42.9360 120.7555    0.355561    1   0
3   2018-04-01 01:30:00 37.8455 114.5514    0.330380    0   0
4   2018-04-01 02:00:00 43.9254 99.1340     0.443091    1   0

Ideal output:

    datehour            pft     rev         mgn         neg dsf
0   2018-04-01 00:00:00 53.1783 110.8514    0.479726    0   0
1   2018-04-01 00:30:00 51.1496 105.9060    0.482972    0   0
2   2018-04-01 01:00:00 42.9360 120.7555    0.355561    1   3
3   2018-04-01 01:30:00 37.8455 114.5514    0.330380    0   0
4   2018-04-01 02:00:00 43.9254 99.1340     0.443091    1   2
Deepak
  • 149
  • 1
  • 1
  • 11
  • Do you need to count just the null values? I see from your desired output that you count the 1 also. one count for 0(first row), another for 0 (second row) and a third count for 1 (3 row where neg = 1) ? – cnicollet Apr 19 '18 at 10:26
  • so is your question resolved right now? There are several users that put a lot of effort into there answers. Please provide feedback – Quickbeam2k1 Apr 20 '18 at 12:06
  • Hi, yes, the question's been resolved. I'll comment below, thank you. – Deepak Apr 20 '18 at 12:17

3 Answers3

1

You should initialise counter outside the for loop. Here is an example:

df = pd.DataFrame({'neg': [0, 0, 1, 0, 1]}) 

df['dsf'] = 0
counter  = 1

for i, j in df.iterrows():
 if j['neg'] == 0:
  j['dsf'] = 0
  counter += 1
else:
  j['dsf'] = counter
  counter = 1

df  

output:

   neg dsf
0   0   0
1   0   0
2   1   3
3   0   0
4   1   2

Notice that the result is exactly the same as your desired output. But if you want to count just the null values, so you should initialize the count to be 0 outside and at the end of the for loop. And the result should be like this:

neg dsf
0   0   0
1   0   0
2   1   2
3   0   0
4   1   1
cnicollet
  • 143
  • 7
  • Actually, `iterrows` works differently for dataframe with rows like `[0]` and dataframe with rows like `['2018-04-01 00:00:00', 53.1783, 110.8514, 0.479726, 0]`. In first case, `iterrows` returns view, in later it returns copy – Grigoriy Mikhalkin Apr 19 '18 at 10:41
  • 1
    But your are using *counter* to fill the new column. I f you start with counter = 0, then the first iteration give you counter = 1, the second iteration: neg = 0 == > counter = 2, the third iteration: neg = 1 ==> the if statement is not true (neg == 0), so you execute the *else* which gives you df['dsf'] = 2 (the last value of counter). So comparing your code with the desired output, you see that you will not have the same result when you initialize counter to 0. – cnicollet Apr 19 '18 at 10:50
  • Yes, the counter needs to be initialized to 1. However, as far as the logic is concerned, isn't the code listed here, the same as my original post? – Deepak Apr 20 '18 at 12:20
0

From iterrows docs:

You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

So, in your case, in for loop, you're not modifying original DataFrame, because iterrows returns copy. For more details on views and copies read http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Here is fixed version of your code:

df = pd.DataFrame([
    ['2018-04-01 00:00:00', 53.1783, 110.8514, 0.479726, 0], 
    ['2018-04-01 00:30:00', 51.1496, 105.9060, 0.482972, 0], 
    ['2018-04-01 01:00:00', 42.9360, 120.7555, 0.355561, 1], 
    ['2018-04-01 01:30:00', 37.8455, 114.5514, 0.330380, 0], 
    ['2018-04-01 02:00:00', 43.9254, 99.1340,  0.443091, 1]], 
    columns=['datehour', 'pft', 'rev', 'mgn', 'neg'])

df['dsf'] = 0
counter = 0

for i,r in df.iterrows():
    counter += 1
    if r['neg'] != 0:
        df.loc[i, 'dsf'] = counter
        counter = 0

print(df)
#                datehour     pft      rev         mgn   neg      dsf
# 0   2018-04-01 00:00:00 53.1783 110.8514    0.479726    0         0
# 1   2018-04-01 00:30:00 51.1496 105.9060    0.482972    0         0
# 2   2018-04-01 01:00:00 42.9360 120.7555    0.355561    1         3
# 3   2018-04-01 01:30:00 37.8455 114.5514    0.330380    0         0
# 4   2018-04-01 02:00:00 43.9254 99.1340     0.443091    1         2
Grigoriy Mikhalkin
  • 5,035
  • 1
  • 18
  • 36
  • 1
    This is the solution I ended up using because it seemed the simplest and least deviation from my logic. And it worked. so yes, thanks a lot mate! – Deepak Apr 20 '18 at 12:19
0

This is jus a different solution to your problem that should be much faster compared to usage of iterrows. You should always try to use as much vectorization as possible with pandas.

df = pd.DataFrame({'neg': [0,0,1, 0, 1,0, 0, 1]})
indexes = df[df['neg'] == 1].index
shifted = indexes + 1
values = indexes - indexes.to_series().shift().fillna(0)
df.assign(dfs=pd.Series(vals, index=indexes)).fillna(0)

    neg dfs
0   0   0.0
1   0   0.0
2   1   3.0
3   0   0.0
4   1   2.0
5   0   0.0
6   0   0.0
7   1   3.0

if you want you can out yourself convert the dfs column to int

Quickbeam2k1
  • 5,287
  • 2
  • 26
  • 42
  • This seems like the most pythonic way to do things, but I actually got it solved with the other answer down below. Let me come back to this and test it out and post comments here. Thanks! – Deepak Apr 20 '18 at 12:22
  • I'm not entirely sure how the assign statement works, so I resorted to: `df['dsf']=pd.Series(values, index=indexes)` and yes you're right, this worked and did it much faster. However, the Nan's remained as Nan's inspite of fillna. But I think that might be a separate issue, so this is fine now. Thank you – Deepak Apr 21 '18 at 04:51
  • maybe just use `df = df.assign(...)`, this will overwrite `df`with the correct values. Assign just append a column without modifying the underlying data frame. Concerning the nan's, look that I am applying them after assign. In your solution, the pd.Series stuff does not cause nan's (check it, the series is nan free), the assignment to 'dns' will cause nans since there are for some index values of df no value is passed. – Quickbeam2k1 Apr 21 '18 at 19:17
  • Yes, the nans arise at the assignment step, but they didn't change despite fillna(0) after the new column was added in the dataframe. However, that's a separate issue- will have to look into why the Nans aren't disappearing with fillna. But as of now, I think this was a smart way to go about the problem since till now, I had simply ignored using the pandas index for any and all logic and simply used to use iterrows and loc whereever required. Now I think I'm gonna start using them. Thanks buddy! – Deepak Apr 22 '18 at 09:30
  • `assigns by default not an inplace function. In general you should try to avoid iterating over rows, since this interferes with pandas vectorization capabilities. – Quickbeam2k1 Apr 22 '18 at 10:34