How to standardize the format of element in the list from big data

Question

Trying to count unique value from the following list without using collection:

('TOILET','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')

The output which I require is :

('TOILET':2,'AIR CONDITIONiNGS':3)

My code currently is

for i in Data:
    if i in number:
        number[i] += 1
    else:
        number[i] = 1
print number

Is it possible to get the output?

Assuming that `number` is a dictionary prior to the loop that should be fine... The output you expect isn't valid syntax... what isn't working/what are you getting instead? (Also - your `tuple` example isn't valid syntax either - and somehow your `i` has become lowercase in the expected results...) — Jon Clements, Oct 14 '17 at 15:16
By using my current code its result will be ('TOILET' :1,''TOILETS':1,'AIR CONDITIONING':1,'AIR- CONDITIONINGS':1,'AIR-CONDITIONING':1) — Gaming, Oct 14 '17 at 15:19
Which is to be expected - TOILET and TOILETS aren't the same string and nor are AIR CONDITIONING AIR-CONDITIONINGS and AIR-CONDITITIONING... Your issue isn't with counting the frequency of the data - you need to standardise your data somehow first... — Jon Clements, Oct 14 '17 at 15:21
@Gaming. Then it's not unique elements that you are trying to count. You have to explain in excruciating detail what it means for two items to be the same in that case. — Mad Physicist, Oct 14 '17 at 15:23
Oh yes standardize the data, are there any way to dealt on big data? — Gaming, Oct 14 '17 at 15:27
Maybe use string similarity as explored in [this SO Q&A](https://stackoverflow.com/q/17388213/2823755) - you will need to determine *how similar* they must be to be the same. But it might get messy comparing all the combinations. — wwii, Oct 14 '17 at 15:49

MarianD · Answer 1 · 2017-10-14T17:16:13.380

original = ('TOILETS', 'TOILETS', 'AIR CONDITIONING', 
            'AIR-CONDITIONINGS', 'AIR-CONDITIONING')
a_set = set(original)
result_dict = {element: original.count(element) for element in a_set}

First, making a set from original list (or tuple) gives you all values from it, but without repeating.

Then you create a dictionary with keys from that set and values as occurrences of them in the original list (or tuple), employing the count() method.

Ajax1234 · Answer 2 · 2017-10-14T15:32:10.643

0

You can try this:

import re
data = ('TOILETS','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
new_data = [re.sub("\W+", ' ', i) for i in data]
print new_data
final_data = {}
for i in new_data:
   s = [b for b in final_data if i.startswith(b)]
   if s:
      new_data = s[0]
      final_data[new_data] += 1
   else:
      final_data[i] = 1

print final_data

Output:

{'TOILETS': 2, 'AIR CONDITIONING': 3}

edited Oct 14 '17 at 15:32

answered Oct 14 '17 at 15:23

Ajax1234

69,937
8
61
102

You're building a list to check it's not empty, then building the list again to take the first element... huh? – Jon Clements Oct 14 '17 at 15:27
Okay... now try with `data = ('T', 'TOILETS','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')`... – Jon Clements Oct 14 '17 at 15:29
Perhaps... although the OP needs to work out what `if i.startswith(b)` should be for their purposes given their data... – Jon Clements Oct 14 '17 at 15:34

score 0 · Answer 3 · answered Oct 14 '17 at 15:28

I don't believe the python list has an easy built-in way to do what you are asking. It does, however, have a count method that can tell you how many of a specific element there are in a list. Example:

some_list = ['a', 'a', 'b', 'c']
some_list.count('a')  #=> 2

Usually the way you get what you want is to construct an incrementable hash by taking advantage of the Hash::get(key, default) method:

some_list = ['a', 'a', 'b', 'c']
counts = {}
for el in some_list
    counts[el] = counts.get(el, 0) + 1
counts #=> {'a' : 2, 'b' : 1, 'c' : 1}

score 0 · Answer 4 · edited Oct 14 '17 at 17:21

0

a = ['TOILETS', 'TOILETS', 'AIR CONDITIONING', 'AIR-CONDITIONINGS', 'AIR-CONDITIONING']
b = {}

for i in a:
    b.setdefault(i,0)
    b[i] += 1

You can use this code, but same as Jon Clements`s talk, TOILET and TOILETS aren't the same string, you must ensure them.

edited Oct 14 '17 at 17:21

Josh Karpel

2,110
2
10
21

answered Oct 14 '17 at 15:32

Nick.Tao

1
1

wwii · Accepted Answer · 2017-10-14T17:02:32.440

Using difflib.get_close_matches to help determine uniqueness

import difflib
a = ('TOILET','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
d = {}
for word in a:
    similar = difflib.get_close_matches(word, d.keys(), cutoff = 0.6, n = 1)
    #print(similar)
    if similar:
        d[similar[0]] += 1
    else:
        d[word] = 1

The actual keys in the dictionary will depend on the order of the words in the list.

difflib.get_close_matches uses difflib.SequenceMatcher to calculate the closeness (ratio) of the word against all possibilities even if the first possibility is close - then sorts by the ratio. This has the advantage of finding the closest key that has a ratio greater than the cutoff. But as the dictionary grows the searches will take longer.

If needed, you might be able to optimize a little by sorting the list first so that similar words appear in sequence and doing something like this (lazy evaluation) - choosing an appropriately large cutoff.

import difflib, collections
z = collections.OrderedDict()
a = sorted(a)
cutoff = 0.6
for word in a:
    for key in z.keys():
        if difflib.SequenceMatcher(None, word, key).ratio() > cutoff:
            z[key] += 1
            break
    else:
        z[word] = 1

Results:

>>> d
{'TOILET': 2, 'AIR CONDITIONING': 3}
>>> z
OrderedDict([('AIR CONDITIONING', 3), ('TOILET', 2)])
>>>

I imagine there are python packages that do this sort of thing and may be optimized.

How to standardize the format of element in the list from big data

5 Answers5