15
import pandas as pd

businesses = pd.read_json(businesses_filepath, lines=True, encoding='utf_8')
restaurantes = businesses['Restaurants' in businesses['categories']]

I would like to remove the lines that do not have Restaurants in the categories column, and this column has lists, however gave the error 'KeyError: False' and I would like to understand why and how to solve.

panchester
  • 325
  • 1
  • 4
  • 13
  • 3
    `'Restaurants' in businesses['categories']` is a native python expression that evaluates to the scalar `False` (or `True`). Is `'Restaurants'` not simply one of the categories? In that case you need `businesses[businesses.categories == 'Restaurants']`. – Andras Deak -- Слава Україні Jul 02 '17 at 21:33

5 Answers5

27

The expression 'Restaurants' in businesses['categories'] returns the boolean value False. This is passed to the brackets indexing operator for the DataFrame businesses which does not contain a column called False and thus raises a KeyError.

What you are looking to do is something called boolean indexing which works like this.

businesses[businesses['categories'] == 'Restaurants']
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
  • I'm experiencing the same issue in my code: ```df.loc[('Error:' in df['Quick Insights'])``` Problem is that I need the ```in``` part, as the word "Error:" is just a substring of ```quick insights```. How would you address that? Is there any operator that acts like SQL's ```like```? – goidelg Aug 19 '21 at 06:49
  • 2
    solved using ```df[df['Quick Insights'].str.match('(Error:)', na=False)]``` – goidelg Aug 19 '21 at 06:56
6

The reason for this is that the Series class implements a custom in operator that doesn't return an iterable like the == does, here's a workaround

businesses[['Restaurants' in c for c in list(businesses['categories'])]]

hopefully this helps someone where you're looking for a substring in the column and not a full match.

iggy12345
  • 1,233
  • 12
  • 31
  • I'm considering just regex matching what I'm looking for, turning it into a new column and indexing using the new column – dang Aug 08 '21 at 23:17
5

If you find that your data contains spelling variations or alternative restaurant related terms, the following may be of benefit. Essentially you put your restaurant related terms in restuarant_lst. The lambda function returns true if any of the items in restaurant_lst are contained within each row of the business series. The .loc indexer filters out rows which return false for the lambda function.

restaurant_lst = ['Restaurant','restaurantes','diner','bistro']
restaurant = businesses.loc[businesses.apply(lambda x: any(restaurant_str in x for restaurant_str in restaurant_lst))]
Joe
  • 51
  • 1
  • 1
1

I think what you meant was :

businesses = businesses.loc[businesses['categories'] == 'Restaurants']

that will only keep rows with the category restaurants

Rayhane Mama
  • 2,374
  • 11
  • 20
0

None of the answers here actually worked for me,

businesses[businesses['categories'] == 'Restaurants']

obviously won't work since the value in 'categories' is not a string, it's a list, meaning the comparison will always fail.

What does, however, work, is converting the column into tuples instead of strings:

businesses['categories'] = businesses['categories'].apply(tuple)

That allows you to use the standard .loc thing:

businesses.loc[businesses['categories'] == ('Restaurants',)]
RedAero
  • 49
  • 3