Filtering data out of several columns

Question

I have this data frame:

import pandas as pd

df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

for col in df.columns:
    if col[:2]=='01':
        df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
    if col[:2]=='02':
        df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#'+col[1:]}, inplace=True)

names_ids = df.index.str.split('\s\(') # split the index by '('

df.index = names_ids.str[0] # the [0] element is the country name (new index) 
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)

df = df.drop('Totals')
df.head()

The olympics.csv file is actually this list.

Here is a picture of the DataFrame (The first five countries):

Here is the code of the first ten countries:

{'# Summer': {'Afghanistan': 13,
  'Algeria': 12,
  'Argentina': 23,
  'Armenia': 5,
  'Australasia': 2,
  'Australia': 25,
  'Austria': 26,
  'Azerbaijan': 5,
  'Bahamas': 15,
  'Bahrain': 8},
 'Gold': {'Afghanistan': 0,
  'Algeria': 5,
  'Argentina': 18,
  'Armenia': 1,
  'Australasia': 3,
  'Australia': 139,
  'Austria': 18,
  'Azerbaijan': 6,
  'Bahamas': 5,
  'Bahrain': 0},
 'Silver': {'Afghanistan': 0,
  'Algeria': 2,
  'Argentina': 24,
  'Armenia': 2,
  'Australasia': 4,
  'Australia': 152,
  'Austria': 33,
  'Azerbaijan': 5,
  'Bahamas': 2,
  'Bahrain': 0},
 'Bronze': {'Afghanistan': 2,
  'Algeria': 8,
  'Argentina': 28,
  'Armenia': 9,
  'Australasia': 5,
  'Australia': 177,
  'Austria': 35,
  'Azerbaijan': 15,
  'Bahamas': 5,
  'Bahrain': 1},
 'Total': {'Afghanistan': 2,
  'Algeria': 15,
  'Argentina': 70,
  'Armenia': 12,
  'Australasia': 12,
  'Australia': 468,
  'Austria': 86,
  'Azerbaijan': 26,
  'Bahamas': 12,
  'Bahrain': 1},
 '# Winter': {'Afghanistan': 0,
  'Algeria': 3,
  'Argentina': 18,
  'Armenia': 6,
  'Australasia': 0,
  'Australia': 18,
  'Austria': 22,
  'Azerbaijan': 5,
  'Bahamas': 0,
  'Bahrain': 0},
 'Gold.1': {'Afghanistan': 0,
  'Algeria': 0,
  'Argentina': 0,
  'Armenia': 0,
  'Australasia': 0,
  'Australia': 5,
  'Austria': 59,
  'Azerbaijan': 0,
  'Bahamas': 0,
  'Bahrain': 0},
 'Silver.1': {'Afghanistan': 0,
  'Algeria': 0,
  'Argentina': 0,
  'Armenia': 0,
  'Australasia': 0,
  'Australia': 3,
  'Austria': 78,
  'Azerbaijan': 0,
  'Bahamas': 0,
  'Bahrain': 0},
 'Bronze.1': {'Afghanistan': 0,
  'Algeria': 0,
  'Argentina': 0,
  'Armenia': 0,
  'Australasia': 0,
  'Australia': 4,
  'Austria': 81,
  'Azerbaijan': 0,
  'Bahamas': 0,
  'Bahrain': 0},
 'Total.1': {'Afghanistan': 0,
  'Algeria': 0,
  'Argentina': 0,
  'Armenia': 0,
  'Australasia': 0,
  'Australia': 12,
  'Austria': 218,
  'Azerbaijan': 0,
  'Bahamas': 0,
  'Bahrain': 0},
 '# Games': {'Afghanistan': 13,
  'Algeria': 15,
  'Argentina': 41,
  'Armenia': 11,
  'Australasia': 2,
  'Australia': 43,
  'Austria': 48,
  'Azerbaijan': 10,
  'Bahamas': 15,
  'Bahrain': 8},
 'Gold.2': {'Afghanistan': 0,
  'Algeria': 5,
  'Argentina': 18,
  'Armenia': 1,
  'Australasia': 3,
  'Australia': 144,
  'Austria': 77,
  'Azerbaijan': 6,
  'Bahamas': 5,
  'Bahrain': 0},
 'Silver.2': {'Afghanistan': 0,
  'Algeria': 2,
  'Argentina': 24,
  'Armenia': 2,
  'Australasia': 4,
  'Australia': 155,
  'Austria': 111,
  'Azerbaijan': 5,
  'Bahamas': 2,
  'Bahrain': 0},
 'Bronze.2': {'Afghanistan': 2,
  'Algeria': 8,
  'Argentina': 28,
  'Armenia': 9,
  'Australasia': 5,
  'Australia': 181,
  'Austria': 116,
  'Azerbaijan': 15,
  'Bahamas': 5,
  'Bahrain': 1},
 'Combined total': {'Afghanistan': 2,
  'Algeria': 15,
  'Argentina': 70,
  'Armenia': 12,
  'Australasia': 12,
  'Australia': 480,
  'Austria': 304,
  'Azerbaijan': 26,
  'Bahamas': 12,
  'Bahrain': 1},
 'ID': {'Afghanistan': 'AFG',
  'Algeria': 'ALG',
  'Argentina': 'ARG',
  'Armenia': 'ARM',
  'Australasia': 'ANZ',
  'Australia': 'AUS',
  'Austria': 'AUT',
  'Azerbaijan': 'AZE',
  'Bahamas': 'BAH',
  'Bahrain': 'BRN'}}

Gold means: Number of summer gold medals.
Gold.1 means: Number of winter gold medals.
Gold.2 means: Number of ALL the gold medals.

The question is:

Which country has the biggest difference between their summer gold medal counts and winter gold medal counts relative to their total gold medal count?

To make it clearer:

$\frac{Summer\text{ }Gold - Winter\text{ }Gold}{Total\text{ }Gold} = maximum\text{ } value$

(Which means: finding the country which had the maximum number of summer gold medals AND minimum number of winter gold medals AND minimum number of total gold medals).

Only include countries that have won at least 1 gold in both summer and winter.

This function should return a single string value.

My solution:

I tried to solve this problem using two steps:

First:
Filtering only the countries which have gold medals (both in summer and winter).

This is the closest code I've reached:

df[(df['Gold'] > 0) & (df['Gold.1'] > 0)]

But it returns a DataFrame, not a list.

Trying this code:

(df['Gold'] > 0) & (df['Gold.1'] > 0)

I've got a list of ALL the countries, and beside them True or False.
So it actually doesn't filter anything.

and also:

print(df.loc[df['Gold'] == True])

which seemed to give a wrong answer.

This is what I've got:

(Note that this list doesn't contain United States, for example).

Second:
Finding the highest value of the fraction:

Well, the codes:

(df['Gold'] - df['Gold.1']).idxmax()

and

df['Gold.2'].idxmin()

are interpreted just fine,

BUT:

(df['Gold'] - df['Gold.1']).idxmax() & df['Gold.2'].idxmin()

Gives me an error:

TypeError: unsupported operand type(s) for &: 'str' and 'str'

I've tried also this code:

df.loc[df['Gold'] > 0 & (df['Gold'] - df['Gold.1']).idxmax() & df['Gold.2'].idxmin()]

which gave me the same error:

TypeError: unsupported operand type(s) for &: 'int' and 'str'

Another problem which I thought about, but didn't get to this step because of the errors, is:

How is the fraction should be presented: a float number? something else?

I looked also here, here, here, here and here.

I found the following functions relevant:

where() , mask() , query() , isin() , all() etc.

But it seems to be a tough question, which non of the above links truly helped me to solve it.

Thanks a lot.

Can you show an example of your DataFrame - linking to wiki doesn't help as it's not easy for any one else to get sample data and how they may extract that data if they wanted to could well be different than the CSV file you've ended up with... — Jon Clements, Dec 15 '18 at 22:42
@JonClements I've added relevant pictures to my explanation. — Yoel Zajac, Dec 15 '18 at 22:53
Pictures don't help that much as it doesn't help others use the data... please consider putting the result of `df.head(10).to_dict()` as a code block in your post, that way people can copy/paste it into their interpreter and load it as a dataframe easily... :) — Jon Clements, Dec 15 '18 at 22:56
To be completely honest with you @YoelZajac, pictures help a little but not too much. The reason being that it's hard to convert them to the textual form and play around with the data a bit to be able to answer your question quickly and efficiently since few people can do these kinds of calculations in their heads without actually trying them out. — ayorgo, Dec 15 '18 at 22:57
@YoelZajac thanks - that gives people something to work with :) — Jon Clements, Dec 15 '18 at 23:14

CJR · Answer 1 · 2018-12-15T23:18:36.897

0

Based on the (first) picture you provided I'd do this

# Rename the columns to make more sense to me
df.columns = ['S-Participation', 'S-Gold', 'S-Silver', 'S-Bronze', 'S-Total', 'W-Participation', 'W-Gold', 'W-Silver', 'W-Bronze', 'W-Total', 'Combined-Participation', 'Combined-Gold', 'Combined-Silver', 'Combined-Bronze', 'Combined-Total', 'ID']

# Get the rows with medals in both
df = df.loc[(df['S-Gold'] > 0) & (df['W-Gold'] > 0), :]

# Calculate difference
df['Diff-Gold'] = (df['S-Gold'] - df['W-Gold']).abs() / (df['S-Gold'] + df['W-Gold'])

# Sort by difference and then take the top value
df = df.sort_values('Diff-Gold', ascending=False)
df['ID'].iloc[0]

I'm not gonna test it cause there's no data to do so.

edited Dec 15 '18 at 23:18

answered Dec 15 '18 at 23:01

CJR

3,916
2
10
23

Tell me what is missing and I'll add it to my question. :-) – Yoel Zajac Dec 15 '18 at 23:09
I've gotta go to sleep... working tomorrow :-( I'll try it tomorrow or so. Thank you so much anyway! :-D – Yoel Zajac Dec 15 '18 at 23:20
Your answer gives me: `'BUL'`, which means `Bulgaria`. Out of the fact that the code is interpreted (which is a good beginning...) - how can I know if the answer is true? It's totally obvious that one can't calculate all the rows manually! If one had to do it - there's no need for the computer... :-P **I'd be glad to know how you check the correctness of your answers**: in this question specifically, and more important: **In general**. Thanks a lot! :-) – Yoel Zajac Dec 29 '18 at 18:26
I assume that you got this to work and I shouldn't reply to your other answer. The best way to test code like this is with some type of [unit test](https://en.wikipedia.org/wiki/Unit_testing). Make a test data set that you know what the correct answer should be, and make sure that your code outputs the same answer. Your test data in the question has 10 countries; 2 of them remain after filtering, and this code yields the correct answer of those two. – CJR Jan 02 '19 at 01:29
Another possibility is to look at the top 4-5 rows after sorting by `Diff-Gold` and check and make sure that they're correctly calculated. – CJR Jan 02 '19 at 01:30

Yoel Zajac · Answer 2 · 2018-12-29T12:48:10.697

I'm trying to see whether the answer which is suggested above is good.

First of all, I'm trying to separate the code into each one of the steps, thus making it easier to analyse and understand.

The first code line, which is:

# Rename the columns to make more sense to me
df.columns = ['S-Participation', 'S-Gold', 'S-Silver', 'S-Bronze', 'S-Total', 'W-Participation', 'W-Gold', 'W-Silver', 'W-Bronze', 'W-Total', 'Combined-Participation', 'Combined-Gold', 'Combined-Silver', 'Combined-Bronze', 'Combined-Total', 'ID']

gives me an error:

Length mismatch: Expected axis has 17 elements, new values have 16 elements

First question:

Having this error, I don't understand why does the interpreter continue interpreting the rest of the code...

Second question:

Following this answer and this answer, I'm still feeling confused, and don't know which one of the following:

index_col=0 , index_col=False or index_col=None

might solve the error in the code line above, and how to embed the right command (out of the three) in it.

Answers to both questions will be greatly appreciated.

Thanks!

Well, I'm actually hesitating if it was the right place asking separately about each line... May I write it as a different question? Or complete all my questions regarding @CJ59 answer? I just thought it is the right place to write my feedback, because it is all related to the same question (all related questions are in one place, instead of messing up the site with many sub-questions, which are all actually related to this question). What would you recommend for me to do guys? — Yoel Zajac, Dec 29 '18 at 15:21

score 0 · Answer 3 · answered Jun 27 '19 at 21:56

import pandas as pd
import numpy as np

df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

for col in df.columns:
    if col[:2]=='01':
        df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
    if col[:2]=='02':
        df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#'+col[1:]}, inplace=True)

names_ids = df.index.str.split('\s\(') # split the index by '('

df.index = names_ids.str[0] # the [0] element is the country name (new index) 
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)

df = df.drop('Totals')
df.head()
def answer_one():
    Diff = df[(df['Gold'] > 0) & (df['Gold.1'] > 0)]
    return ((Diff['Gold'] - Diff['Gold.1']) / Diff['Gold.2']).idxmax()
answer_one()

Filtering data out of several columns

3 Answers3