0

I have defined my dic as follows:

grocery_dict={"apple":"fruite", "pepper": "veg", "spaghetthi":"pasta", "banana":"fruite", "tomato":"fruite"}

and my list is grocery_list=["apple","bananas","pizza","pepper"] I have written a code that allows me to compare the items and delivers the category of the item.

gl=[] 
for item in grocery_list:
    if item in grocery_dict:
        x=grocery_dict[item]
        gl.append(x)
    else:
        x='other'
        gl.append(x)
print(gl)

Next i can caluclate how many times i have each category. Now my issue is how to compare it a part of a word exists in the dictionnary for example if i have items such as "Mexican Pepper" or "tomatto" and how to not consider capital letters in a string.

Another question: Is it possible to use pyspark for such cases?

Thank you in advance

bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
A.Dorra
  • 41
  • 1
  • 2
  • 7
  • See: [Case insensitive dictionary](https://stackoverflow.com/q/2082152/1782792), [fastest way to search python dict with partial keyword](https://stackoverflow.com/q/18066603/1782792). – jdehesa Mar 14 '18 at 16:26
  • For partial string matches, you can use things like `[key for key in grocery_dict if item in key or key in item]` to get a list of partially matching keys. For misspellings and similar words, you're going to want to use a library like `fuzzywuzzy`. Here's a brief introduction: https://marcobonzanini.com/2015/02/25/fuzzy-string-matching-in-python/ – Patrick Haugh Mar 14 '18 at 16:28

1 Answers1

0

This has actually very few to do with dicts and mostly to do with string manipulations and natural language processing.

wrt/ capitalisation and upper/lower case, the solution is simple: only use all lower strings as keys in your dict and apply the .lower() method to all strings in your list, ie:

grocery_list = ["apple","bananas","pizza","pepper"]
normalized_list = [word.lower() for word in grocery_list]

Handling terms like "Mexican Pepper" will be harder. You can of course split the string and look for each part, but if you something like "Apple Tomato" in your list then there's no way to tell whether you want "apple" or "tomato". And handling spelling mistakes will require something like a spellchecker, but here again you can't be sure you'll get a failsafe unambiguous answer.

As a side note: your current code can be vastly simplified:

gl = [grocery_dict.get(name, "other") for name in grocery_list]
bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118