0

I'm trying to return the rows which qualify a certain condition in a for loop, into a new dataframe. So in short, I'd like to see row 4 (result for loop) in a new df. In addition, no duplicate rows. So, if 3 through 5 would be True, then every row should be returned once.

np.random.seed(0)
N = 5

df = pd.DataFrame(
    {'X':np.random.uniform(-5,5,N),
     'Y':np.random.uniform(-5,5,N),
     'Z':np.random.uniform(-5,5,N),
    })
df

enter image description here

for i in range(len(df)):
    for k in range( i+1, len(df)+1 ):
        df_sum = df[i:k].sum()
        print( "rows", i, "to", k, (df_sum>-5).all() & (df_sum.sum()<1 ))

enter image description here

I'd like to return this;

enter image description here

ashishsingal
  • 2,810
  • 3
  • 18
  • 26
Zanshin
  • 1,262
  • 1
  • 14
  • 30

2 Answers2

1

Try this ..

df_result = df[0:0]                                 # new 1
for i in range(len(df)):
    for k in range( i+1, len(df)+1 ):
        df_sum = df[i:k].sum()
        print( "rows", i, "to", k, (df_sum>-5).all() & (df_sum.sum()<1 ))
        if ((df_sum>-5).all() & (df_sum.sum()<1 )): # new 2
            df_result = df_result.append(df[i:k])   # new 3
df_result.drop_duplicates()                         # new 4

The first new line creates an empty DataFrame but preserving the column structure of the original. The second and third new lines append the rows to the result DataFrame if they meet your condition. The final new line removes the duplicates.

Also, note that this solution is not the most performant solution because appending rows to a DataFrame is inefficient. In case performance becomes an issue, you may want to convert df_result to a dictionary and convert it to a DataFrame at the very end.

ashishsingal
  • 2,810
  • 3
  • 18
  • 26
1

Another solution is create helper df and then use isin:

df = pd.DataFrame(
    {'X':np.random.uniform(-5,5,N),
     'Y':np.random.uniform(-5,5,N),
     'Z':np.random.uniform(-5,5,N),
    })
print (df)

arr = []
for i in range(len(df)):
    for k in range( i+1, len(df)+1 ):
        df_sum = df[i:k].sum()
        val = (df_sum>-5).all() & (df_sum.sum()<1 )
        print( "rows", i, "to", k, val)
        arr.append([i,val])
#print (arr)

df1 = pd.DataFrame(arr, columns=['ROWS','MASK']).set_index('ROWS')
idx = df1[df1.MASK].index

print (idx)
Int64Index([4], dtype='int64', name='ROWS')

print (df[df.index.isin(idx)])
          X         Y         Z
4 -0.763452 -1.165585 -4.289639

All together with multiple DataFrames:

np.random.seed(0)
N = 5
df1 = pd.DataFrame(
    {'X':np.random.uniform(-5,5,N),
     'Y':np.random.uniform(-5,5,N),
     'Z':np.random.uniform(-5,5,N),
    })
N = 6
df2 = pd.DataFrame(
    {'X':np.random.uniform(-5,5,N),
     'Y':np.random.uniform(-5,5,N),
     'Z':np.random.uniform(-5,5,N),
    })
N = 7
df3 = pd.DataFrame(
    {'X':np.random.uniform(-5,5,N),
     'Y':np.random.uniform(-5,5,N),
     'Z':np.random.uniform(-5,5,N),
    })

L = [df1, df2, df3]
print (L)

dfs = {}
for j, df in enumerate(L):
    arr = []
    for i in range(len(df)):
        for k in range( i+1, len(df)+1 ):
            df_sum = df[i:k].sum()
            val = (df_sum>-5).all() & (df_sum.sum()<1 )
            #print( "rows", i, "to", k, val)
            arr.append([i,val])
    df1 = pd.DataFrame(arr, columns=['ROWS','MASK']).set_index('ROWS')
    idx = df1[df1.MASK].index
    #print (df[df.index.isin(idx)])
    dfs['df' + str(j + 1)] = df[df.index.isin(idx)]

print (dfs)
{'df1':           X         Y         Z
4 -0.763452 -1.165585 -4.289639, 'df2':           X         Y         Z
0 -4.128707  2.991586  4.446689
1 -4.797816 -0.385206  0.218483
3  2.781568 -3.817256 -2.354444
5  4.786183 -3.566467 -0.438497, 'df3':           X         Y         Z
0  0.684339 -1.404921 -3.710737
1 -4.812102 -0.629680 -1.845716
2  1.176355  1.976312 -1.362892
3  1.120957 -4.397745  0.701968
6  1.818203 -2.896174 -3.979552}

print (dfs['df1'])
          X         Y         Z
4 -0.763452 -1.165585 -4.289639
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thanks, both jezrael and ashishsingal. I have 63 df's I need to run this code. How would I go about that, and do their names change? – Zanshin Jan 12 '17 at 16:10
  • I think you can create dictionary for storing DataFrames and then loop as I see in question before. – jezrael Jan 12 '17 at 16:22
  • Unfortunately now I am out of PC, so it is a bit complicated. But first create empty dictionary `dfs = {}`, and in loop of list of dataframes append data like `for x in [df1, df2,df3]: #all code...dfs[x]=df[df.index.isin(idx)]` I hope it a bit help you. – jezrael Jan 12 '17 at 16:58
  • nope, cannot get it to work. what is in the codeblock and what isn't. thanks though – Zanshin Jan 13 '17 at 10:21
  • Can you explain more what need? – jezrael Jan 13 '17 at 10:22
  • run into problems creating the dictionary + for loop. cannot get it to work properly. which code lines are in the codeblock and which aren't, the ones you suggested I mean. – Zanshin Jan 13 '17 at 10:27
  • Ok, now am going to dinner, but later I try help. So I understand correctlyy, You have `L = ['df1', 'df2', df3']` and desired output is `dfs = {'df1': rows from df1 by conditions by code, 'df2':rows from df2 by conditions by code, 'df3': rows from df3 by conditions by code}` ? – jezrael Jan 13 '17 at 10:31
  • Please check my solution. – jezrael Jan 13 '17 at 13:08
  • works great. run into a problem with the logic though. when I test df2 for val = (df_sum<2).all() & (df_sum.sum()<-6 ), it returns row 7 and 8 as well where it shouldn't. When it evaluates multiple rows it should multiply -6 as well by number of rows. row 7, 8 and 9 should be lower than -18 i.e. – Zanshin Jan 13 '17 at 13:30
  • I am test it and I am not sure if understand. Why `-6` ? Solution test with `-5` only. – jezrael Jan 13 '17 at 13:59
  • My solution create with `True`, `False` values fro loop `DataFrame` - `df1` and give only rows with `True` values, other are remove by `df1[df1.MASK]`. then get index values of filtered dataframe and by `isin` get all rows of original `df`. My logic is wrong? – jezrael Jan 13 '17 at 14:02
  • to test if the logic works in any case. if you consider rows as bins the total sum value of one row in this example is then <-6, if it evaluates 3 rows than the total sum should be <-18. your logic isn't wrong, the conditions are – Zanshin Jan 13 '17 at 14:03
  • I am not sure if understand, maybe author help more - http://stackoverflow.com/a/41595294/2901002 – jezrael Jan 13 '17 at 14:11
  • solved it, add in length of df[i:k]: val = (df_sum<2).all() & (df_sum.sum()<(len(df[i:k])*-6) ). thanks! – Zanshin Jan 13 '17 at 14:20
  • Super, congrat. Btw, I see you prefer structures with many dataframes as list or dict. Is possible for your solutions concat all input `DataFrame` to one `pd.concat([df1, df2, '... dfn])` and then works with one big dataframe? Or not? – jezrael Jan 13 '17 at 14:23
  • yeah, I'm creating df's from different angles and they should result in one eventually. not sure yet how I'lll put them together. will be based on their names though – Zanshin Jan 13 '17 at 14:36
  • I was only curious, ;) Nice day and thank you for accepting. – jezrael Jan 13 '17 at 14:37