1

I've got this table:

A DataFrame table which is made by using Jupyter Notebook.

This is actually only part of the table.

The complete table is actually a .csv file, and by using .head() function, only the first five rows are shown.

I need to write a function that returns and prints the maximum value, out of all the values in the second column, which its label is 'Gold'.
That function should return a single string value.

I looked up at several sources before writing my question, trying many ways to solve my problem.

It seems to be a very easy solution, but unfortunately I didn't succeed to find it.
(Are there maybe several optional solutions to this query...?)

Please help me, I'm totally confused.
Thanks!

Here are all the sources:

And here are all the ways I've tried to solve the problem, some had syntax errors:

1.a: The traditional algorithm to find out the maximum value, like in C language: a 'for' loop.

def answer_one():

row=1

max_gold = df['Gold'].row  # Setting the initial maximum.

for col in df.columns: 

    if col[:2]=='Gold': # finding the column.    

        # now iterating through all the rows, finding finally the absolute maximum:

        for row in df.itertuples():  # I also tried: for row=2 in df.rows:

            if(df['Gold'].row > max_gold)  # I also tried: if(row.Gold > max_gold)

                 max_gold = df['Gold'].row  #  I also tried: max_gold = row.Gold

return df.max_gold

I had problems how to merge the printing function into the code above, so I added it separately:

1.b:

for row in df.itertuples():
    print(row.Gold)         # or: print(max_gold)

1.c:

for col in df.columns: 

if col[:2]=='Gold':

    df[df['Gold'].max()]

2.

def answer_one():

df = pd.DataFrame(columns=['Gold']) # syntax error.

for row in df.itertuples():    # The same as the separated code sction above.
        print(row.Gold)

3.

def answer_one():

print(df[['Gold']][df.Value == df.Value.max()]) # I don't know if "Value" is a key word or not.
  1. def answer_one():
    return df['Gold'].max() # right syntax, wrong result (not the max value). 
    

5.

def answer_one():

s=data.max()

print '%s' % (s['Gold']) # syntax error. 

6.a:

def answer_one():

df.loc[df['Gold'].idxmax()] # right syntax, wrong output (all the column indexes of the table are shown in a column)

6.b:

def answer_one():

df.loc[:,['Gold']]  # or: df.loc['Gold']  

df['Gold'].max()
Yoel Zajac
  • 453
  • 1
  • 6
  • 11

2 Answers2

1

Great first question, I assume you're doing the python for datascience course on coursera?

As already pointed out, df['Gold'].max() is correct however, if the datatype is wrong, it will not return the expected result. So first thing is to make sure it's a number. You can check this by running df['Gold'].dtype if the output isn't int64 for this dataset you can likely correct it by running df.loc[:,'Gold'] = df.loc[:,'Gold'].str.replace(',','').astype(int) after that df['Gold'].max() will return 1022.

When it comes to the for loop, you can in this case iterate over all values in the Gold series, instead of both iterating over all the columns and all the rows. Note that python uses 0 indexing! so if you would used row 1 as starting point you would get the wrong result if the largest value is in the first row (row0), and you index by using [Index] and not .Index. So the for loop could look like this.

CurrentMax = df['Gold'][0]
for value in df['Gold']:
    if value>CurrentMax:
        CurrentMax = value
print(CurrentMax)

Wrapped as a function:

def rowbyrow(df=df):
    CurrentMax = df['Gold'][0]
    for value in df['Gold']:
        if value>CurrentMax:
            CurrentMax = value
    #print(CurrentMax) if you want to print the result when running
    return CurrentMax

Regarding point 3. I believe what you're after is below, it filters Gold by where the value of Gold is equal to the maximum value, as you used two brackets around Gold this will return a dataframe and not just the value: df[['Gold']][df.Gold == df.Gold.max()] with one bracket it would return a series: df['Gold'][df.Gold == df.Gold.max()]

Regarding point 5, syntax error might be caused if you're using python 3? In python 3 you need to use () after print statement so below should work:

s=df.max()
print('%s' % (s['Gold']))

Regarding point 6:a if you want to output only a specific column, you need to pass that column(s) after the filtering condition (separated by a ,) like below:

df.loc[df['Gold'].idxmax(),'Gold']

if you want to return several columns you can pass a list e.g.

df.loc[df['Gold'].idxmax(),['Country','Gold']]

for point 1:c, [:2] will return the first two letters. So will always be false when compared with the four letter word Gold.

Some performance comparisons:

1.

%%timeit
df.loc[df['Gold'].idxmax(),'Gold']
10000 loops, best of 3: 76.6 µs per loop

2.

%%timeit
s=df.max()
'%s' % (s['Gold'])
1000 loops, best of 3: 733 µs per loop

3.

%%timeit
rowbyrow()
10000 loops, best of 3: 71 µs per loop

4.

%%timeit
df['Gold'].max()
10000 loops, best of 3: 106 µs per loop

I was surprised to see that the function rowbyrow() had the fastest result.

After creating a series with 10k random values, rowbyrow() was still the fastest.

Look here:

df = pd.DataFrame((np.random.rand(10000, 1)), columns=['Gold']) 

%%timeit  # no. 1
df['Gold'].max()

The slowest run took 10.30 times longer than the fastest.   
10000 loops, best of 3: 127 µs per loop


%%timeit  # no. 2
rowbyrow()

The slowest run took 8.12 times longer than the fastest.   
10000 loops, best of 3: 72.7 µs per loop
Yoel Zajac
  • 453
  • 1
  • 6
  • 11
Pureluck
  • 326
  • 2
  • 10
  • Thank you so much for your comprehensive answer! And yes, this is exactly a question of the coursera course that you've mentioned! I will check out again all the solutions that you suggested - in my Jupyter Notebook. Have a great day, take care. – Yoel Zajac Dec 04 '18 at 18:51
0

Well, after checking all the solutions suggested above, all of them return the same value: 976.

But it doesn't return 1022 (the right answer) anyway.

Look here:

here:

and also here:

The last picture shows that the returned value is actually already of type 'int64', and NOT of type 'str', whether I check the value type using dtype() function before the following snippet:

def answer_one():
    return df['Gold'].max()

answer_one()

or after it.

Regarding the code line:

df.loc[:,'Gold'] = df.loc[:,'Gold'].str.replace(',','').astype(int)

which had been proposed above, and which is used to cast from 'str' value type (a string) to 'int64' value type (a number) - it returns me an error message, since it is not a 'str' type anyway.

Should anyone answer me why don't I get the right answer? (976 instead of 1022)
Is it a problem of my Jupyter NoteBook? Maybe something else?

Thanks!

Yoel Zajac
  • 453
  • 1
  • 6
  • 11