Finding out the number of lists with a certain length in Python

Question

Given the following list of lists

arrayNumbers = [[32,3154,53,13],[44,34,25,67], [687,346,75], [57,154]]

how can I efficiently get the number of lists having only 4 items?

In this case, that would be arrayNumbers_len = 2. I can do this using a loop, but that is not efficient at all. Since the lengths of my real arrays are in the millions, I need a way to do this extremely fast.

Here is my current solution:

batchSize = 4
counter = 0
for i in range(len(arrayNumbers)):
    if (len(arrayNumbers[i]) == batchSize):
        counter += 1

Any suggestions?

That's a perfectly reasonable Python solution, and you aren't going to get around a loop (either explicitly or implicitly). — miradulo, Feb 28 '18 at 18:52
because i though that maybe there is a way to do it with numpy — RaduS, Feb 28 '18 at 18:53
But you are working with `list` objects, not `numpy.ndarray` objects, and anyway, you probably couldn't work easily with `numpy.ndarray` objects with "jagged" arrays... — juanpa.arrivillaga, Feb 28 '18 at 18:54
i see, well if there is no way around it, then i will just use my current solution — RaduS, Feb 28 '18 at 18:56
@RaduS Sounds like an XY problem. Why you got an array of this strange shape ? If the source of the problem can be handled by numpy, it's possible that you don't need this strange array at all. — llllllllll, Feb 28 '18 at 18:58
@RaduS see also https://stackoverflow.com/questions/10713004/find-length-of-2d-array-python. I think this is what you are looking for. Sample `print "Number of rows: {}".format(len(arrayNumbers))`. — Thanos, Feb 28 '18 at 19:03
@RaduS which version of python are you using? If it's python 2, use list comprehension. If it's 3 use Prune's solution. See my timing results below. — pault, Feb 28 '18 at 19:58

DBedrenko · Answer 1 · 2018-02-28T19:16:27.613

3

Would this be of acceptable performance?

len([l for l in arrayNumbers if len(l) == 4])

If this is still too slow, you can write the algorithm in C or C++, and call this from your Python code. See more here for details: https://docs.python.org/3.6/extending/extending.html

edited Feb 28 '18 at 19:16

answered Feb 28 '18 at 18:51

DBedrenko

4,871
4
38
73

@miradulo My thoughts are that `filter` being part of the core Python library, would be well optimized. At least more optimized than manually writing the loop at "low level" in Python (or reinventing `filter()` by yourself in Python) . Ideally `filter()` is one of the functions that's implement in C in CPython. – DBedrenko Feb 28 '18 at 18:57
`filter` + `lambda` will tend to be slow. `filter` + built-in function will perform nicely. list-comprehensions woudl be better in this case, since it isn't a built-in, and you can avoid the function-call overhead of calling the lambda a million times (which is high in Python) – juanpa.arrivillaga Feb 28 '18 at 18:58
@juanpa.arrivillaga Thanks for the tip, I learnt something today :) – DBedrenko Feb 28 '18 at 19:46

score 3 · Answer 2 · answered Feb 28 '18 at 19:00

3

Sorry, but just in raw Information Science terms, you're stuck with an O(N) problem, where N is the number of elements in your list. You have to access each length to test it against batchSize. With that, however, we can stuff it into a one-liner that gives Python a chance to optimize as best it can:

map(len, arraynumbers).count(4)

answered Feb 28 '18 at 19:00

Prune

76,765
14
60
81

1

In python 3.x, I believe you need `list(map(...))`. But I don't think creating a list is necessary (see my solution). – jpp Feb 28 '18 at 19:01
@jpp yeah but this will almost certainly be faster than anything involving a generator expression. Generators tend to be slow (there's a lot of overhead to generators). Check out my benchmarks – juanpa.arrivillaga Feb 28 '18 at 19:25
Yup you are probably right. Both generators and lists have overheads, will need to do more research to see which has more and why. – jpp Feb 28 '18 at 19:26
The OP's hope was that by using `numpy` the **O(N)** problem could be performed with compiled code, which runs orders of magnitude faster than interpreted Python. But `numpy` doesn't provide this kind of building block. – hpaulj Feb 28 '18 at 19:37

juanpa.arrivillaga · Accepted Answer · 2018-02-28T19:28:26.760

I went ahead and did some timings to show how these different approaches vary.

Note: These are in Python 3:

Note: var_arr has a million randomly-sized sublists:

In [31]: def for_loop(var_arr, batchsize):
    ...:     count = 0
    ...:     for x in var_arr:
    ...:         if len(x) == batchsize:
    ...:             count += 1
    ...:     return count
    ...:

In [32]: def with_map_count(var_arr, batchsize):
    ...:     return list(map(len, var_arr)).count(batchsize)
    ...:

In [33]: def lambda_filter(var_arr, batchsize):
    ...:     len(list(filter(lambda l: len(l) == batchsize, var_arr)))
    ...:

In [34]: def sum_gen(var_arr, batchsize):
    ...:     sum(len(x) == batchsize for x in var_arr)
    ...:

In [35]: from collections import Counter
    ...: def with_counter(var_arr, batchsize):
    ...:     Counter(map(len, var_arr)).get(batchsize, 0)
    ...:

In [36]: %timeit for_loop(var_arr, 4)
82.9 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [37]: %timeit with_map_count(var_arr, 4)
48 ms ± 873 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [38]: %timeit lambda_filter(var_arr, 4)
172 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [39]: %timeit sum_gen(var_arr, 4)
150 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [40]: %timeit with_counter(var_arr, 4)
75.6 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Some more timings:

In [50]: def with_list_comp_filter(var_arr, batchsize):
    ...:     return len([x for x in var_arr if len(x) == batchsize])
    ...:
    ...: def with_list_comp_filter_map(var_arr, batchsize):
    ...:     return len([x for x in map(len, var_arr) if x == batchsize])
    ...:
    ...: def loop_with_map(var_arr, batchsize):
    ...:     count = 0
    ...:     for x in map(len, var_arr):
    ...:         count += x == batchsize
    ...:     return count
    ...:

In [51]: %timeit with_list_comp_filter(var_arr, 4)
87.8 ms ± 4.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [52]: %timeit with_list_comp_filter_map(var_arr, 4)
62.7 ms ± 1.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [53]: %timeit loop_with_map(var_arr, 4)
91.9 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

What about the list comprehension `len([l for l in arrayNumbers if len(l) == 4])` (DBedrenko's 2nd solution). I ran some tests and that came out to be the fastest. Would be good if you could independently confirm. — pault, Feb 28 '18 at 19:14
@pault it's definitely not, the `list(map(len, arr)).count(4)` is the fastest so far — juanpa.arrivillaga, Feb 28 '18 at 19:18
Interesting. I reran my tests with a million elements and I get the list comp to be fastest with the `list(map(len, arr)).count(4)` coming in second. Here is how I made my test array: `arrayNumbers = [[np.random.randint(0, 10)]*np.random.randint(0, 10) for _ in range(1000000)]` — pault, Feb 28 '18 at 19:22
@pault yes, then use `map(len, arr).count(4)`, using `list` makes you iterate over the million elements again unnecessarily (a testament to it's speed, given that it still comes in second) — juanpa.arrivillaga, Feb 28 '18 at 19:24
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/165992/discussion-between-pault-and-juanpa-arrivillaga). — pault, Feb 28 '18 at 19:25

score 2 · Answer 4 · answered Feb 28 '18 at 19:29

I ran my own tests in python 2 and it appears that list comprehension (@DBedrenko's updated solution) is the fastest with @Prune's map(len, arraynumbers).count(4) coming in second:

nLists = 1000000
arrayNumbers = [[np.random.randint(0, 10)]*np.random.randint(0, 10) for _ in range(nLists)]
batchSize = 4
In [67]:

%%timeit
counter = 0
for i in range(len(arrayNumbers)):
    if (len(arrayNumbers[i]) == batchSize):
        counter += 1
10 loops, best of 3: 108 ms per loop
In [68]:

%%timeit
map(len, arrayNumbers).count(4)
10 loops, best of 3: 65.7 ms per loop
In [69]:

%%timeit
len(list(filter(lambda l: len(l) == 4, arrayNumbers)))
10 loops, best of 3: 121 ms per loop
In [70]:

%%timeit
len([l for l in arrayNumbers if len(l) == 4])
10 loops, best of 3: 58.6 ms per loop
In [71]:

%%timeit
sum(len(i)==4 for i in arrayNumbers)
10 loops, best of 3: 97.8 ms per loop

So, Python 2's list-comprehension implementation relies on a hack which essentially shares the local scope wherever it is (which is why it has "leaky" scope]. In Python 3, it get's it's own scope. This may be the cause of the difference, but I'm not too deeply aware of the implementation details to even speculate, really. — juanpa.arrivillaga, Feb 28 '18 at 19:47

score 2 · Answer 5 · answered Feb 28 '18 at 21:09

Here is a numpy solution. It is only marginally slower than the best of the non-numpy answers. One advantage would be that one could get the counts for all lengths at minimal additional cost unless there are ridiculously large sublists:

>>> import numpy as np
>>> from timeit import timeit
>>> from collections import Counter
>>> 
>>> lengths = np.random.randint(0, 100, (100_000))
>>> lists = [l * ['x'] for l in lengths]
>>> 
>>> 
# count one
# best Python
>>> list(map(len, lists)).count(16)
974
# numpy
>>> np.count_nonzero(16==np.fromiter(map(len, lists), int, len(lists)))
974
>>> 
# count all
# best Python
>>> [cc for c, cc in sorted(Counter(map(len, lists)).items())]
[973, 1007, 951, 962, 1039, 962, 1028, 999, 970, 997,
 ... 
 1039, 997, 976, 1028, 1026, 969, 1106, 994, 1002, 1022]
>>> 
# numpy
>>> np.bincount(np.fromiter(map(len, lists), int, len(lists)))
array([ 973, 1007,  951,  962, 1039,  962, 1028,  999,  970,  997,
       ...
       1039,  997,  976, 1028, 1026,  969, 1106,  994, 1002, 1022])

Timings:

>>> kwds = dict(globals=globals(), number=100)
>>> 
>>> timeit('list(map(len, lists)).count(16)', **kwds)
0.38265155197586864
>>> timeit('np.count_nonzero(16==np.fromiter(map(len, lists), int, len(lists)))', **kwds)
0.4332483590114862
>>> 
>>> timeit('Counter(map(len, lists))', **kwds)
0.673214758047834
>>> timeit('np.bincount(np.fromiter(map(len, lists), int, len(lists)))', **kwds)
0.43800772598478943

score 1 · Answer 6 · answered Mar 01 '18 at 10:12

1

Using filter with an anonymous function:

>>> Numbers = [[32,3154,53,13],[44,34,25,67],[687,346,75],[57,154]]
>>> filter(lambda x: len(x) == 4, Numbers)
[[32, 3154, 53, 13], [44, 34, 25, 67]]
>>> len(filter(lambda x: len(x) == 4, Numbers))
2

answered Mar 01 '18 at 10:12

Rodrigo de Azevedo

1,097
9
17

Finding out the number of lists with a certain length in Python

6 Answers6

Note: These are in Python 3: