1

My task is creating a function that can find a score for different DataFrame objects that are passed into it, i settled on using if else statements to attempt to make a score, but i keep running into ValueError exceptions.

My data is all contained in dataframes as i collected it from csv files and performed analysis on them, will be using generic data for the purposes of the question here since i can't use the actual data for contract reasons.

df = pd.DataFrame(np.random.randint(0,1000,size=(1000)))

just using a generic random generated data frame just to see if i can make the idea work

def generic_function_name(self):
    score=0
    if ((df> 700) and (DF>500) and (DF>300)== True):
        score += 3
        if ((df>500) and (df>300) ==True):
            score+=2
            if ((df>300)== True):
                score += 1
                if ((df>300) ==False):
                    score +=0
            print(score)
            return

this is the function I've created, but I keep getting the following exception:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I'm sure someone much more competent than me will probably be horrified at my creation but I would please beg that you keep the laughter to a minimum while you explain just how wrong I am.

edits

Ok so following someof your suggestions i made changes to the function

def generic_function_name(self):
    score=0
    if ((df> 700))  :
        score += 3
        if ((df>500)):
            score+=2
            if ((df>300)):
                score += 1
                if ((df<300)):
                    score +=0
            print(score)
            return

then whed i do generic_function_name(df) it returns

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

so the problem is still their

df.head(20)

     0
0   822
1   484
2   471
3   866
4   883
5   578
6   986
7   133
8   801
9   126
10  415
11  777
12  956
13  2
14  273
15  281
16  741
17  999
18  699
19  367

i have been informed im doing too many comparisons, and i feel i need to say that the events which this is a data is a generic version of have higher values which are equivalent to high danger that i need to look out for, and the middle and lower thresholds are meant to be equivalent to middle to low danger, which is why i had so many comparisons, since i want the score to include low to high dangers in the results, just was weighting the higher danger worth more to the score then the lower dangers.

if thier is an easier way please help me understand as i struggle with understanding how to create this score anyother way.

James
  • 11
  • 3
  • seing your first line... Challenge accepted. (note my english isn't the best please be tolerant about that). please try this modifications, add them to your ask (modifiing it) and let's keep trying: 1) `self` word is ussually pased when you're working with am object o class. if not the case please remove it. 2) you need to pass to the function the name of the dataframe (`df` in this case) so you can call it with any other dataset. 3) python variables are case sensitive so `df` and `DF` are not the same. 4) what are you trying to do with `df>300`? df is the whole dataframe what are you seeking – Ulises Bussi Oct 25 '22 at 17:42
  • Does this answer your question? [Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()](https://stackoverflow.com/questions/36921951/truth-value-of-a-series-is-ambiguous-use-a-empty-a-bool-a-item-a-any-o) – Zack Walton Oct 25 '22 at 18:05
  • how does your dataframe looks like? can you post df.head() so can understand the problem? – Ulises Bussi Oct 25 '22 at 19:49
  • Last of all, when you put `df>500` you're hoping to return true when: a) all values are greater than 500? one value is? the length of the dataframe es greater than 500? – Ulises Bussi Oct 26 '22 at 01:35

3 Answers3

0

Instead of using or or and -> use | or & respectively (when working with pandas)

Reason: or or and require truth-values. Whereas with pandas objects (Series in your case) are considered ambiguous so you should use | (or) or & (and) bitwise operations

Other Pointers:

  1. DF and df are not referencing the same variable. If this is not intended, you could simplify your if statements conditions a lot.
  2. It's not very pythonic to check for == True, for example
if value == True:
  pass

# is the same as
if value:
  pass

Resources

See this question: Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

DataFrame docs: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

Zack Walton
  • 191
  • 1
  • 12
  • ok i followed suggestions like the removal of the and, replacing it with the &, correcting the df accident which was not deliberate, just brain fart and holding of the shift key, cleaning up the == true thing, and i get ``` TypeError: generic_function_name() takes 0 positional arguments but 1 was given ``` – James Oct 25 '22 at 18:37
  • @James pandas apply() function actually passes the data for you to work with (to your function), see the documentation with examples: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html Please also consider accepting my answer if it fixed your original question :) – Zack Walton Oct 25 '22 at 19:02
  • ok so, i performed the changes, removed the self like that other comment said but i got the new value error, once i returned the self i got the original value error, so i dont feel this answered the original question i had, either that or im too dumb to realise how it solves my question. – James Oct 25 '22 at 19:11
  • The reality is this code is riddled with errors and I can't really go through each one like this. 1. You are using an excessive amount of comparisons when compared to what you really need. 2. You should really read the documentation about DataFrame objects on the link above. For example do you want to use .iloc() to check the value at a position in your 1 dimensional data frame (more info in documentation)? Once you get that down, your original issue is fixed. You can't use `if (DataFrame)` (which is why you're getting an error) which is also explained in duplicate issue that I linked. – Zack Walton Oct 25 '22 at 19:47
  • im sorry zach, i genuinely dont understand what im meant to do, you tell me i use an excessive amount of comparisons to what i need but i genuinely have no idea how to cut it down, also how would i use the iloc() function you suggested, im struggling to think how it will allow me to count the number of values above a certain threshold. – James Oct 25 '22 at 20:35
  • That's fine @James :) I am getting downvoted now ahhh. 1. Just one example for the comparisons, why would you check if a value is greater than 700, 500, and 300? Just check if it's greater than 300, the others are true then as well. 2. If you want to count the number over a certain threshold, use this: `df[(df > threshold_integer).any(0)]`, It should get rid of all indices that have a value over that threshold, compare the number of entries after that point to get how many. That is a one-liner by the way, so you don't need to use apply() or a function at all. – Zack Walton Oct 25 '22 at 20:54
  • resource: https://stackoverflow.com/questions/42613467/how-to-select-all-rows-which-contain-values-greater-than-a-threshold – Zack Walton Oct 25 '22 at 20:55
  • ok so i think this may be due to my failure to communicate the problem properly, will add this too the problem at the top. My task is to create a score using the data i have been given to check how bad people are doing at something, the higher the score the worse they are doing, so when i made the three thresholds, it was cause i needed a way to differentiate the worst from the best, so the lowest threshold adds the least amount of points to the score, and the highest threshold adds the most, thats why i had the three values, im trying to create a function i can use to discover a score. – James Oct 25 '22 at 21:03
  • so you need a score for every entry in the frame? i.e. every person? – Ulises Bussi Oct 26 '22 at 13:34
  • Each dataframe is supposed to be an individual day, but i managed to figure out a separate way to make a score, that doesn't use the if else function, will write it in a following answer, thank you Ulises and Zack for helping me work out where i went wrong in my approach to the task i needed to do – James Oct 26 '22 at 16:20
0

Mangaged to work out a seperate way to get to the core of my question which was making a scoring system for dataframes using thresholds, its probably pretty ugly but it works, which is what matters in the end.

def Rough_function(df):
    df1 = df[(df >= 700)]
    df2= df[(df<700)& (df>=500)]
    df3= df[(df<500)& (df>=300)]
    a=df1.value_counts().sum()
    b=df2.value_counts().sum()
    c=df3.value_counts().sum()
    score=(a*3)+(b*2)+c
    print(score)

Using the before mentioned random dataframe line to create different df’s, results in different scores being printed, and when i tried it out on the actual data dataframe i have it works as well, thank you Zack and Ulises for your help, i would have been stuck trying to force my initial idea to work without your input.

James
  • 11
  • 3
0

given your solution I have a proposal that can be more usefull:

def rough_function(df):
    #bins are 0=<x<300, 300=<x<500, 500=<x<700, 700=<x<inf
    bins = [0,300,500,700,np.inf] 
    #value for every bin
    values = np.array([0,1,2,3])
    counts = df.groupby( pd.cut(df[0], bins,right=False)).count().values[:,0]
    score = np.sum(counts*values)
    #print(score)
    return score

that have some advantages:

  1. Is little faster
from timeit import timeit

print(timeit(lambda: Rough_function(df), number=1000))
print(timeit(lambda: rough_function(df), number=1000))

throws (changing print to return in both to be fair):

4.102351880981587
1.376866523991339
  1. it's customizable, you can change easy the values where separate scores (changing bins) or the value for every bin, also you can change the quantity of beans (adding values to both bins and values) without adding more lines.
Ulises Bussi
  • 1,635
  • 1
  • 2
  • 14