3

I'm new to Python and am still trying to tear myself away from C++ coding techniques while in Python, so please forgive me if this is a trivial question. I can't seem to find the most Pythonic way of doing this.

I have two lists of dicts. The individual dicts in both lists may contain nested dicts. (It's actually some Yelp data, if you're curious.) The first list of dicts contains entries like this:

{business_id': 'JwUE5GmEO-sH1FuwJgKBlQ',
 'categories': ['Restaurants'],
 'type': 'business'
 ...}

The second list of dicts contains entries like this:

{'business_id': 'vcNAWiLM4dR7D2nwwJ7nCA',
 'date': '2010-03-22',
 'review_id': 'RF6UnRTtG7tWMcrO2GEoAg',
 'stars': 2,
 'text': "This is a basic review",
 ...}

What I would like to do is extract all the entries in the second list that match specific categories in the first list. For example, if I'm interested in restaurants, I only want the entires in the second list where the business_id matches the business_id in the first list and the word Restaurants appears in the list of values for categories.

If I had these two lists as tables in SQL, I'd do a join on the business_id attribute then just a simple filter to get the rows I want (where Restaurants IN categories, or something similar).

These two lists are extremely large, so I'm running into both efficiency and memory space issues. Before I go and shove all of this into a SQL database, can anyone give me some pointers? I've messed around with Pandas some, so I do have some limited experience with that. I was having trouble with the merge process.

TheOriginalBMan
  • 237
  • 2
  • 3
  • 12

6 Answers6

2

Suppose your lists are called l1 and l2:

All elements from l1:

[each for each in l1]

All elements from l1 with the Restaurant category:

[each for each in l1
      if 'Restaurants' in each['categories']]

All elements from l2 matching id with elements from l1 with the Restaurant category:

[x for each in l1 for x in l2 
   if 'Restaurants' in each['categories']
   and x['business_id'] == each['business_id'] ]
elyase
  • 39,479
  • 12
  • 112
  • 119
  • Thanks for this! I really like how you broke down the list comprehensions. This is one thing that's taking me a while to fully comprehend in Python. – TheOriginalBMan Jan 22 '15 at 02:36
2

Let's define sample lists of dictionaries:

first = [
        {'business_id':100, 'categories':['Restaurants']},
        {'business_id':101, 'categories':['Printer']},
        {'business_id':102, 'categories':['Restaurants']},
        ]

second = [
        {'business_id':100, 'stars':5},
        {'business_id':101, 'stars':4},
        {'business_id':102, 'stars':3},
        ]

We can extract the items of interest in two steps. The first step is to collect the list of business ids that belong to restaurants:

ids = [d['business_id'] for d in first if 'Restaurants' in d['categories']]

The second step is to get the dicts that correspond to those ids:

[d for d in second if d['business_id'] in ids]

This results in:

[{'business_id': 100, 'stars': 5}, {'business_id': 102, 'stars': 3}]
John1024
  • 109,961
  • 14
  • 137
  • 171
1

This is pretty tricky, and I had fun with it. This is what I'd do:

def match_fields(business, review):
    return business['business_id'] == review['business_id'] and 'Restaurants' in business['categories']

def search_businesses(review):
    # the lambda binds the given review as an argument to match_fields
    return any(lambda business: match_fields(business, review), business_list)

answer = filter(search_businesses, review_list)

This is the most readable way I found. I'm not terribly fond of list comprehensions that go past one line, and three lines is really pushing it. If you want this to look more terse, just use shorter variable names. I favor long ones for clarity.

I defined a function that returns true if an entry can be matched between lists, and a second function that helps me search through the review list. I then can say: get rid of any review that doesn't have a matching entry in the businesses list. This pattern works well with arbitrary checks between lists.

jack
  • 2,094
  • 1
  • 19
  • 17
  • I like this as well. Coming from OOP and functional programming background, this is definitely easy to understand. Thanks for this! – TheOriginalBMan Jan 22 '15 at 02:37
  • @TheOriginalBMan, just so you know in Python list comprehensions [are preferred](http://stackoverflow.com/questions/1247486/python-list-comprehension-vs-map) to map, one could say it is the Pythonic way of doing functional style. Of course this is subjective and it might be justified in some cases. – elyase Jan 22 '15 at 02:43
  • @elyase No one language feature in Python is preferred in all circumstances. In this case, a list comprehension would need to do a lot of logic, and doesn't end up being very readable, as we see in the above answers. 'Practicality beats purity. Readability counts.' -the Zen – jack Jan 22 '15 at 02:44
  • @jack, agree with your first point, that is why I wrote "it might be justified in some cases". I don't agree that my solution has more logic and about what is more readable, this is subjective and I respect that you see it differently but I have just shown both solutions to my GF (non dev, literature background) and she just told me she has no idea what your solution does while mine reads as sentence. That is exactly how I see it. – elyase Jan 22 '15 at 02:54
  • @elyase I'm not saying that your solution needs any more logic than mine, only that putting that same logic in a list comprehension gets really cramped. Also: showing your code to a non-dev means little, since you're essentially just comparing the look of two language features regardless of the application. 'filter' and 'any' are built-in for good reason, and another dev can reasonably be expected to know them. – jack Jan 22 '15 at 02:58
1

Python programmers like using list comprehensions as a way to do both their logic and their design.

List comprehensions lead to terser and more compact expression. You're right to think of it quite a lot like a query language.

x = [comparison(a, b) for (a, b) in zip(A, B)] 
x = [comparison(a, b) for (a, b) in itertools.product(A, B)] 
x = [comparison(a, b) for a in A for b in B if test(a, b)]
x = [comparison(a, b) for (a, b) in X for X in Y if test(a, b, X)]

...are all patterns that I use.

ramsey0
  • 1,587
  • 1
  • 12
  • 10
1

As a variation to the list comprehension only approaches, it may be more efficient to use a set and generator comprehension. This is especially true if the size of your first list is very large or if the total number of restaurants is very large.

restaurant_ids = set(biz for biz in first if 'Restaurants' in biz['categories'])
restaurant_data = [rest for rest in second if rest['id'] in restaurant_ids]

Note the brute force list comprehension approach is O(len(first)*len(second)), but it uses no additional memory storage whereas this approach is O(len(first)+len(second)) and uses O(number_of_restaurants) extra memory for the set.

b4hand
  • 9,550
  • 4
  • 44
  • 49
0

You could do: restaurant_ids = [biz['id'] for biz in list1 if 'Restaurants' in biz['categories']] restaurant_data = [rest for rest in list2 if rest['id'] in restaurant_ids]

Then restaurant_data would contain all of the dictionaries from list2 that contain restaurant data.

b4hand
  • 9,550
  • 4
  • 44
  • 49
mway
  • 615
  • 5
  • 14