0

I am running a fairly complex filter on a dataframe in pandas (I am filtering for passing test results against 67 different thresholds via a dictionary). In order to do this I have the following:

query_string = ' | '.join([f'{k} > {v}' for k , v in dictionary.items()])
test_passes = df.query(query_string, engine='python')

Where k is the test name and v is the threshold value.

This is working nicely and I am able to export the rows with test passes to csv.

I am wondering though if there is a way to also attach a column which counts the number of test passes. So for example if the particular row recorded 1-67 test passes.

3 Answers3

1

So I finally 'solved' with the following, starting after the pandas query originally posted. The original question was for test passes by my use case if actually for test failures which....

test_failures = data.query(query_string, engine='python').copy()

The copy is to prevent unintentional data manipulation and chaining error messages.

for k, row in test_failures.iterrows():
    failure_count=0
    test_count=0
    for key, val in threshold_dict.items():
        test_count +=1
        if row[key] > val:
            failure_count +=1
    test_failures.at[k, 'Test Count'] = test_count
    test_failures.at[k, 'Failure Count'] = failure_count

From what I have read iterrows() is not the fastest iterative method but it does provide the index (k) and a data dictionary (row) separately which I found more useful for these purposes than the tuple returned in itertuples().

sorted_test_failures = test_failures.sort_values('Failure Count', ascending=False)  

sorted_test_failures.to_csv('failures.csv', encoding='utf8')

A little sorting and saving to finish.

I have tested on a dummy data set of (8000 x 66) - it doesn't provide groundbreaking speed but it does the job. Any improvements would be great!

  • Ok I think I want to ask does your dictionary contianing the solutions (aka the 'passing values', have to be a dictionary? Because I'm thinking that if it too is a pandas dataframe, things could get easier. – RockAndRoleCoder Mar 24 '19 at 01:50
  • The only reason it's a dictionary is that I like the functionality of key value pairs. I had not considered using another pandas dataframe as the analytical tool. I'll have a look into that to see if it improves performance and analytics. Thanks @RockAndRoleCoder – Mike Fields Mar 24 '19 at 16:36
0

This was answered here:

https://stackoverflow.com/a/24516612/6815750

But to give an example you could do the following:

new_df = df.apply(pd.Series.value_counts, axis = 1) #where df is your current dataframe holding the pass/fails

df[new_df.columns] = new_df

Example

RockAndRoleCoder
  • 320
  • 2
  • 10
  • The difficultly is that in my current data frame I don't have a boolean 'pass/fail' but rather values 0.006. It is these values which I am testing for pass / fail using the external dictionary. – Mike Fields Mar 22 '19 at 17:00
  • So to be clear, your dictionary is holding the key vlaues that determine if your dataframes row values are passing? – RockAndRoleCoder Mar 22 '19 at 20:12
  • Can you upload a sample row from your dataframe? If the rows are very long, just trim it down to 5 columns, the final resolution should be able to be scaled up to your needs. – RockAndRoleCoder Mar 22 '19 at 20:14
  • You are totally correct - my dictionary determines if the row values in the dataframe are 'passes' or 'fails'. I actually found a solution which I will post below...perhaps you can let me know your thoughts? – Mike Fields Mar 24 '19 at 01:13
0

You can use the following approach instead:

dictionary = {'a':'b', 'b': 'c'}
data = pd.DataFrame({'a': [1,2,3], 'b': [ 2,1,2], 'c': [2,1,1] })
test_components = pd.DataFrame([df.loc[:, k] > df.loc[:, v] for k , v in dictionary.items()]).T
# now can inspect what conditions were met in `test_components` variable
condition = test_components.any(axis=1)
data_filtered = data.loc[common_condition, :] 
bubble
  • 1,634
  • 12
  • 17
  • my dictionary is more like {'a': 0.005, 'b': 0.26} with 65 such entries. my dataframe is thousands of lines with values read from csv. – Mike Fields Mar 22 '19 at 17:18