0

I want to replace the substrings of words in the below column with a defined category.

My dataframe contains a column 'amenities' and there are 14K rows in the dataset. For each cell of 'amenities' there is long list of substrings. Please see the picture attached.

enter image description here

Sample Data:

Listing ID Amenities
01 "TV", "Machine", "Refrigerator", "hair dryer", "micro", "parking lot", "shower", "body soap"
02 "HDTV", "washing machine", "fridge", "microwave", "parking", "soap"
03 "TV with netflix", "free parking", "soap"
04 "TV with Amazon Prime", "parking area", "Refrigerator", "dryer for hair", "microwave"

As you can see that each amenity is mentioned in different styles in different cells. For Example, TV is mentioned as TV, HDTV, TV with netflix, 22 inch TV.

Basically, i want to replace the related amenities and put them in one category, so for anything that contains TV, will be replaced with word TV, but as there are other amenities written as substring in "" in the same cell, that is making it complicated for me.

Can anyone please help me.

TWG
  • 33
  • 5
  • 3
    Can you post your data in a [reproducible format](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples)? It makes the life of people helping you easier. – Nick ODell Dec 04 '21 at 01:02
  • Hi Nick, i have added sample data in reproducible format. Can you please help me now. Thanks a lot. – TWG Dec 04 '21 at 16:14

1 Answers1

0

Is this what you are looking for? Loop through each word and replace with "TV"? It'll take the list of amenities (split by ', '), then iterate.

df['Amenities'].apply(lambda x: ['"TV"' if 'TV' in w else w for w in x.split(', ')])


0    ["TV", "Machine", "Refrigerator", "hair dryer", "micro", "parking lot", "shower", "body soap"]
1                               ["TV", "washing machine", "fridge", "microwave", "parking", "soap"]
2                                                                    ["TV", "free parking", "soap"]
3                             ["TV", "parking area", "Refrigerator", "dryer for hair", "microwave"]

If you have multiple conditions, you can create a function instead of using lambda. then pass the function via apply()

def replaceword(x):
    nlist = []
    for w in x.split(', '):
        if 'TV' in w:
            nlist.append('"TV"')
        #elif:
            #other conditions
        else:
            nlist.append(w)
    return nlist

df['Amenities'].apply(replaceword)
Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14
  • And how can i check if the code has effected the whole 14K rows? – TWG Dec 04 '21 at 17:11
  • I have checked manually, its not working... – TWG Dec 04 '21 at 17:36
  • What’s not working? It works for the example provided. Errors? Output incorrect? – Jonathan Leon Dec 04 '21 at 19:10
  • I applied it to my original dataframe. The output is incorrect. I could still see TV mentioned differently. I tried only with TV condition to see if it works correctly. But it didn't. I checked manually. – TWG Dec 04 '21 at 22:53
  • if you are still needing help, you can modify your sample data with the data that isn't producing the correct results and show the incorrect results produced vs expected – Jonathan Leon Dec 05 '21 at 02:50