Count unique values, ignore the spelling

Question

I have a problem. I have the following dataframe. I want to count all the unique values. As you can see the problem is, that some of the words are uppercase or lowercase but are compleately the same thing i want to count. So in my case "Wifi" and "wifi" should be counted as 2. Same for the others. Is there a way i can do that by for example ignore the upper and lower case? And as you can see there are different writings for wifi (for example "Wifi 230 mb/s") is there a way to count the wifis when wifi is in the string?

d = {'host_id': [1, 1, 1, 2, 2, 3, 3, 3, 3],
     'value': ['Hot Water', 'Wifi', 'Kitchen',
               'Wifi', 'Hot Water',
               'Coffe Maker', 'wifi', 'hot Water', 'Wifi 230 mb/s']}
df = pd.DataFrame(data=d)

print(df)


print(len(df[df['value'].str.contains("Wifi", case=False)]))
print(df['value'].unique())
print(len(df['value'].unique()))

[out]
   host_id        value
0        1    Hot Water
1        1         Wifi
2        1      Kitchen
3        2         Wifi
4        2    Hot Water
5        3  Coffe Maker
6        3         wifi
7        3    hot Water
8        3    Wifi 230 mb/s

4 # count wifi

['Hot Water' 'Wifi' 'Kitchen' 'Coffe Maker' 'wifi' 'hot Water'] # unique values
6 # len unique values

What [out] should look like:

        value     count
0   Hot Water         3
1        Wifi         4
2     Kitchen         1
3 Coffe Maker         1

jezrael · Accepted Answer · 2021-11-15T10:28:25.887

2

If there is problem only with wifi - possible another substrings use:

df['value'] = (df['value'].mask(df['value'].str.contains("Wifi", case=False), 'wifi')
                          .str.title())
print (df)
   host_id        value
0        1    Hot Water
1        1         Wifi
2        1      Kitchen
3        2         Wifi
4        2    Hot Water
5        3  Coffe Maker
6        3         Wifi
7        3    Hot Water
8        3         Wifi

print(df['value'].value_counts())
Wifi           4
Hot Water      3
Kitchen        1
Coffe Maker    1
Name: value, dtype: int64


print(df.groupby('value', sort=False).size().reset_index(name='count'))
         value  count
0    Hot Water      3
1         Wifi      4
2      Kitchen      1
3  Coffe Maker      1

EDIT:

#counts original values wit hconvert to uppercase first latters
s = df['value'].str.title().value_counts()
print (s)
Wifi             3
Hot Water        3
Wifi 230 Mb/S    1
Kitchen          1
Coffe Maker      1
Name: value, dtype: int64

#filter if counts greater like N
N = 2
good = s.index[s.gt(N)]       
print (good)
Index(['Wifi', 'Hot Water'], dtype='object')

#extract values by list good

import re

pat = '|'.join(r"\b{}\b".format(x) for x in good)
df['new'] = df['value'].str.extract(rf'({pat})', expand=False, flags=re.I).str.title()
print (df)
   host_id          value        new
0        1      Hot Water  Hot Water
1        1           Wifi       Wifi
2        1        Kitchen        NaN
3        2           Wifi       Wifi
4        2      Hot Water  Hot Water
5        3    Coffe Maker        NaN
6        3           wifi       Wifi
7        3      hot Water  Hot Water
8        3  Wifi 230 mb/s       Wifi

df1 = df.groupby('new', sort=False).size().reset_index(name='count')
print (df1)
         new  count
0  Hot Water      3
1       Wifi      4


#get values not matched to good list (working if no NaNs in original column)    
df2 = df[df['new'].isna()].groupby('value', sort=False).size().reset_index(name='count')
print (df2)
         value  count
0      Kitchen      1
1  Coffe Maker      1

If need both:

df = pd.concat([df1, df2], ignore_index=True)

edited Nov 15 '21 at 10:28

answered Nov 15 '21 at 09:51

jezrael

822,522
95
1,334
1,252

Thank you very much for your effort. The problem is that there is problem with also the other values and not only wifi. – Test Nov 15 '21 at 09:56
1

@Test - iwas worried aboit it. Is possible specified values for counts? How is possible identify `good substrings` for count and `bad substrings` for remove, replace to space? – jezrael Nov 15 '21 at 09:57
For example you can see this line of code `(df.groupby('value', sort=False).size().reset_index(name='count')` this give us back the values, and lets say if a value is counted more than 3 time than it is good substrig. And all below sould be ignored. – Test Nov 15 '21 at 10:04
1

@Test - answer was edited. – jezrael Nov 15 '21 at 10:14
Wow! Thank you very much that is perfect for my project. The cherry on the cake would be to write out the disselected values. So in that case we get for example a second list with the values which havent made it into the good substring. I would like the final count to include the values that are not taken into account by the substring. – Test Nov 15 '21 at 10:22
What I mean `0 Hot Water 3 1 Wifi 4 2 Kitchen 1 3 Coffe Maker 1` but with your perfect code with the substrings – Test Nov 15 '21 at 10:23
1

@Test - hmmm, finally are only 2 values, `Hot Water` and `Wifi`, so `Kitchen` and `Coffe` not possible processing. – jezrael Nov 15 '21 at 10:26
1

@Test - answer was edited. with column `new` and counts. – jezrael Nov 15 '21 at 10:30
Thank you!! That is what I want, but `df2` is empty when I run the code :( – Test Nov 15 '21 at 10:31
1

@Test - hmm, not some typo? – jezrael Nov 15 '21 at 10:32
`Empty DataFrame Columns: [value, count] Index: []` What I got, sorry for that. I copied every line of your code – Test Nov 15 '21 at 10:34
1

@Test - in sample data? Do you use column `new` ? Is correct `new` and then `value` in `df2 = df[df['new'].isna()].groupby('value', sort=False).size().reset_index(name='count')` ? – jezrael Nov 15 '21 at 10:36
For the short time i uploaded a picture of your code (see in the question). So you can see my outputs. Hope that helps ! – Test Nov 15 '21 at 10:38
@Test - Problem is in your code is necessary replace `df['value'] = df['value'].str.extract(rf'({pat})', expand=False, flags=re.I).str.title()` to `df['new'] = df['value'].str.extract(rf'({pat})', expand=False, flags=re.I).str.title()` - check `value` column has no NaNs like in your ouptut. – jezrael Nov 15 '21 at 10:41

score -1 · Answer 2 · answered Nov 15 '21 at 09:48

-1

Use str.lower() or str.upper() method on your list before comparing them. That should eliminate duplicates. If you would like to eliminate typos or other similar strings you can use python-Levenshtein to calculate distance and set 'cut off point' https://pypi.org/project/python-Levenshtein/

answered Nov 15 '21 at 09:48

Edds

1
2

Thank you. And how could I count the `value`s? – Test Nov 15 '21 at 09:50
This post will answer your question in details. https://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item – Edds Nov 15 '21 at 09:53

Count unique values, ignore the spelling

2 Answers2