Find most common string in a 2D list

Question

I have a 2D list:

arr = [['Mohit', 'shini','Manoj','Mot'],
      ['Mohit', 'shini','Manoj'],
      ['Mohit', 'Vis', 'Nusrath']]

I want to find the most frequent element in the 2D list. In the above example, the most common string is 'Mohit'.

I know I can use brute force using two for loops and a dictionary to do this, but is there a more efficient way using numpy or any other library?

The nested lists could be of different lengths

Can someone also add the time of their methods? To find the fasted method. Also the caveats at which it might not be very efficient.

Edit

These are the timings of different methods on my system:

#timegb
%%timeit
collections.Counter(chain.from_iterable(arr)).most_common(1)[0][0]
5.91 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

#Kevin Fang and Curious Mind
%%timeit
flat_list = [item for sublist in arr for item in sublist]
collections.Counter(flat_list).most_common(1)[0]
6.42 µs ± 501 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
c = collections.Counter(item for sublist in arr for item in sublist).most_common(1)c[0][0]
6.79 µs ± 449 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

#Mayank Porwal
def most_common(lst):
    return max(set(lst), key=lst.count)
%%timeit
ls = list(chain.from_iterable(arr))
most_common(ls)
2.33 µs ± 42.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

#U9-Forward
%%timeit
l=[x for i in arr for x in i]
max(l,key=l.count)
2.6 µs ± 68.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Mayank Porwal's method runs the fastest on my system.

How do you define 'frequent`, occurs most number of times or occurs in most number of sub-lists/rows? — Kevin Fang, Nov 23 '18 at 05:46
Any constraints on number of elements in the nested array (n) vs. number of nested array (m). i.e. m >> n or n << m? — bigdata2, Nov 23 '18 at 05:53
@bigdata2 not really. The 2D list is unlikely to be very big. Even the elements within the list. — Mohit Motwani, Nov 23 '18 at 05:57
I advise against the name `arr`, you have a list here, while the name `arr` suggests an `array.array` or a `numpy.array` (for example). — timgeb, Nov 23 '18 at 06:04
@timgeb Thank you. Will keep that in mind. Was just using it as an example. — Mohit Motwani, Nov 23 '18 at 06:05
@MohitMotwani the timings kinda depend on the length of the list and the number of unique elements in it. The max(set... solution is fast for lists with few unique elements — timgeb, Nov 23 '18 at 06:31
@timegb yes. But in my use case, I'll have few unique elements because many will be filtered out because of regex conditioning. — Mohit Motwani, Nov 23 '18 at 06:36
@MohitMotwani if your list is small, like 10 elements as shown above, you can throw all timings out of the window anyway. Unless you need to perform that operation millions of times. — timgeb, Nov 23 '18 at 06:44
@MohitMotwani Please check my edited answer which includes timings. I created the array 100 times bigger than yours, and the code runs pretty fast. — Mayank Porwal, Nov 23 '18 at 06:46

Kevin Fang · Answer 1 · 2018-11-23T06:12:28.847

4

I'd suggest flatten out the 2D Array and then use a counter to find out the most frequent element.

flat_list = [item for sublist in arr for item in sublist]
from collections import Counter
Counter(flat_list).most_common(1)[0]
# ('Mohit', 3)
Counter(flat_list).most_common(1)[0][0]
# 'Mohit'

Not sure if it is the fastest approach though.

Edit:

@timgeb's answer has a faster way to flatten the list using itertools.chain

A more space efficient way suggested by @schwobaseggl:

from collections import Counter
c = Counter(item for sublist in arr for item in sublist).most_common(1)
# [('Mohit', 3)]
c[0][0]
# 'Mohit'

edited Nov 23 '18 at 06:12

answered Nov 23 '18 at 05:52

Kevin Fang

1,966
2
16
31

1

You might also be more space-efficient by using a nested generator expression instead of a list. Rather store the counter in a variable as you need it either way, while you do not need the list. – user2390182 Nov 23 '18 at 06:02
@KevinFang Thank you. This works. Can you also add the time taken by your method? It'd be easier to check the fastest method. – Mohit Motwani Nov 23 '18 at 06:07

timgeb · Accepted Answer · 2018-11-23T06:36:16.590

Flatten the list with itertools.chain.from_iterable
Apply a Counter.

Demo:

>>> from itertools import chain
>>> from collections import Counter
>>> 
>>> lst = [['Mohit', 'shini','Manoj','Mot'],
...:      ['Mohit', 'shini','Manoj'],
...:      ['Mohit', 'Vis', 'Nusrath']]
...:      
>>> Counter(chain.from_iterable(lst)).most_common(1)[0][0]
'Mohit'

Details:

>>> list(chain.from_iterable(lst))
['Mohit',
 'shini',
 'Manoj',
 'Mot',
 'Mohit',
 'shini',
 'Manoj',
 'Mohit',
 'Vis',
 'Nusrath']
>>> Counter(chain.from_iterable(lst))
Counter({'Manoj': 2, 'Mohit': 3, 'Mot': 1, 'Nusrath': 1, 'Vis': 1, 'shini': 2})
>>> Counter(chain.from_iterable(lst)).most_common(1)
[('Mohit', 3)]

Some timings:

>>> lst = lst*100
>>> %timeit Counter(chain.from_iterable(lst)).most_common(1)[0][0] # timgeb
53.7 µs ± 411 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit max([x for i in lst for x in i], key=l.count) # U9-Forward
207 µs ± 389 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit Counter([x for sublist in lst for x in sublist]).most_common(1)[0][0] # Curious_Mind/Kevin Fang #1
75.2 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit Counter(item for sublist in lst for item in sublist).most_common(1)[0][0] # Kevin Fang #2
95.2 µs ± 2.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit flat = list(chain.from_iterable(lst)); max(set(flat), key=flat.count) # Mayank Porwal
98.4 µs ± 178 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

(Note that Kevin Fang's second solution is a bit slower than the first one, but more memory efficient.)

Yeah looks like `chain` is the fastest flatten method from [this question](https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists) — Kevin Fang, Nov 23 '18 at 05:58
@timegb Thank you. This works. Can you also add the time taken by your method? It'd be easier to check the fastest method. — Mohit Motwani, Nov 23 '18 at 06:08
@MohitMotwani all the timings need to be done on the same computer because only then they are comparable. I will do some timings in a minute. — timgeb, Nov 23 '18 at 06:09

Mayank Porwal · Answer 3 · 2018-11-23T06:40:38.533

2

Something like this:

In [920]: from itertools import chain
In [923]: arr = list(chain.from_iterable(arr)) ## flatten into 1-D array
In [922]: def most_common(lst):
     ...:     return max(set(lst), key=lst.count)

In [924]: most_common(arr)
Out[924]: 'Mohit'

Timings:

from itertools import chain
import time
start_time = time.time()

arr = [['Mohit', 'shini','Manoj','Mot'],
      ['Mohit', 'shini','Manoj'],
      ['Mohit', 'Vis', 'Nusrath']]


arr = list(chain.from_iterable(arr))
arr = arr*100

def most_common(lst):
    return max(set(lst), key=lst.count)

print(most_common(arr))
print("--- %s seconds ---" % (time.time() - start_time))

mayankp@mayank:~$ python t1.py 
Mohit
--- 0.000154972076416 seconds ---

edited Nov 23 '18 at 06:40

answered Nov 23 '18 at 05:54

Mayank Porwal

33,470
8
37
58

Thank you. This works. Can you also add the time taken by your method? It'd be easier to check the fastest method. – Mohit Motwani Nov 23 '18 at 06:09
@MohitMotwani Edited the timings too. Also note `arr= arr*100` in my code. – Mayank Porwal Nov 23 '18 at 06:40

A l w a y s S u n n y · Answer 4 · 2018-11-23T06:22:44.687

2

One way to do it this way,

import collections
import time
start_time = time.time()
arr = [['Mohit', 'shini','Manoj','Mot'],
      ['Mohit', 'shini','Manoj'],
      ['Mohit', 'Vis', 'Nusrath']]

c = collections.Counter([x for sublist in arr for x in sublist])
print(c.most_common(1) )
print("--- %s seconds ---" % (time.time() - start_time))

Time taken: 0.00016713142395 seconds

DEMO: http://tpcg.io/NH3zjm

edited Nov 23 '18 at 06:22

answered Nov 23 '18 at 05:57

A l w a y s S u n n y

36,497
8
60
103

1

@Curios_Mind Thank you. This works. Can you also add the time taken by your method? It'd be easier to check the fastest method. – Mohit Motwani Nov 23 '18 at 06:09
1

@MohitMotwani **EDITED** – A l w a y s S u n n y Nov 23 '18 at 06:23

score 1 · Answer 5 · answered Nov 23 '18 at 05:59

Or why not:

l=[x for i in arr for x in i]
max(l,key=l.count)

Code example:

>>> arr = [['Mohit', 'shini','Manoj','Mot'],
      ['Mohit', 'shini','Manoj'],
      ['Mohit', 'Vis', 'Nusrath']]
>>> l=[x for i in arr for x in i]
>>> max(l,key=l.count)
'Mohit'
>>>

Find most common string in a 2D list

5 Answers5