1

I am starting a new practice module in pandas where we deal with indexing and filtering of data. I have come across a format of method chaining that was not explained in the course and I was wondering if anyone could help me make sense of this. The dataset is from the fortune 500 company listings.

df = pd.read_csv('f500.csv', index_col = 0)

The issue is that we have been taught to use boolean indexing by passing the bool condition to the dataframe like so;

motor_bool = df["industry"] == "Motor Vehicles and Parts"
motor_countries = df.loc[motor_bool, "country"]

The above code was to find the countries that have "Motor Vehicles and Parts" as their industries. The last exercise in the module asks us to

" Create a series, industry_usa, containing counts of the two most common values in the industry column for companies headquartered in the USA."

And the answer code is

industry_usa = f500["industry"][f500["country"] == "USA"].value_counts().head(2)

I don't understand how we can suddenly use df[col]df[col] back to back? Am I not supposed pass the bool condition first then specify which column i want to assign it to using .loc? The method chaining the used is very different to what we have practiced.

Please help. I am truly confused.

As always, thanks you, stack community.

1 Answers1

2

I think last solution is not recommended, here better is use DataFrame.loc like second solution for get column industry by mask and then get counts:

industry_usa = f500.loc[f500["country"] == "USA", "industry"].value_counts().head(2)

Another solution with Series.nlargest:

industry_usa = f500.loc[f500["country"] == "USA", "industry"].nlargest(2)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Thank you so much. It worked perfectly. Also thank you for consistently helping me with learning python. I have learned a lot from your comments. You are a pandas guru! – Oscar Agbor May 04 '20 at 10:28
  • For some reason, when I tried the same thing using another mask it doesn't work? ``` boul = df['previous_rank'].isnull() df.loc[df['previous_rank'].isnull()] ``` The code above only returns the column labels. Blank Dataframe – Oscar Agbor May 04 '20 at 15:47
  • 1
    @Oscar Agbor Hard question, is DataFrame same? Or is possible no missing data? Or there are strings NaNs, so need `df.loc[df['previous_rank'] == 'NaN']`? – jezrael May 04 '20 at 16:59
  • Yes it is the same DataFrame. Some rows have missing values. ``` df.isnull().sum() ``` – Oscar Agbor May 04 '20 at 19:41
  • 1
    @Oscar Agbor Can you test if missing values in `previous_rank` column by `df['previous_rank'].isnull().sum()`? Because if use ``` df.isnull().sum() ``` it test all columns, so possible no missing values in `previous_rank` column – jezrael May 05 '20 at 01:00