3

Given the following list of lists

arrayNumbers = [[32,3154,53,13],[44,34,25,67], [687,346,75], [57,154]]

how can I efficiently get the number of lists having only 4 items?

In this case, that would be arrayNumbers_len = 2. I can do this using a loop, but that is not efficient at all. Since the lengths of my real arrays are in the millions, I need a way to do this extremely fast.

Here is my current solution:

batchSize = 4
counter = 0
for i in range(len(arrayNumbers)):
    if (len(arrayNumbers[i]) == batchSize):
        counter += 1

Any suggestions?

Rodrigo de Azevedo
  • 1,097
  • 9
  • 17
RaduS
  • 2,465
  • 9
  • 44
  • 65
  • 1
    That's a perfectly reasonable Python solution, and you aren't going to get around a loop (either explicitly or implicitly). – miradulo Feb 28 '18 at 18:52
  • 1
    why is this tagged with `numpy`? – juanpa.arrivillaga Feb 28 '18 at 18:52
  • because i though that maybe there is a way to do it with numpy – RaduS Feb 28 '18 at 18:53
  • But you are working with `list` objects, not `numpy.ndarray` objects, and anyway, you probably couldn't work easily with `numpy.ndarray` objects with "jagged" arrays... – juanpa.arrivillaga Feb 28 '18 at 18:54
  • i see, well if there is no way around it, then i will just use my current solution – RaduS Feb 28 '18 at 18:56
  • @RaduS Sounds like an XY problem. Why you got an array of this strange shape ? If the source of the problem can be handled by numpy, it's possible that you don't need this strange array at all. – llllllllll Feb 28 '18 at 18:58
  • @RaduS see also https://stackoverflow.com/questions/10713004/find-length-of-2d-array-python. I think this is what you are looking for. Sample `print "Number of rows: {}".format(len(arrayNumbers))`. – Thanos Feb 28 '18 at 19:03
  • @RaduS which version of python are you using? If it's python 2, use list comprehension. If it's 3 use Prune's solution. See my timing results below. – pault Feb 28 '18 at 19:58
  • i use python 3.5.4 – RaduS Feb 28 '18 at 20:00

6 Answers6

3

Would this be of acceptable performance?

len([l for l in arrayNumbers if len(l) == 4])

If this is still too slow, you can write the algorithm in C or C++, and call this from your Python code. See more here for details: https://docs.python.org/3.6/extending/extending.html

DBedrenko
  • 4,871
  • 4
  • 38
  • 73
  • @miradulo My thoughts are that `filter` being part of the core Python library, would be well optimized. At least more optimized than manually writing the loop at "low level" in Python (or reinventing `filter()` by yourself in Python) . Ideally `filter()` is one of the functions that's implement in C in CPython. – DBedrenko Feb 28 '18 at 18:57
  • `filter` + `lambda` will tend to be slow. `filter` + built-in function will perform nicely. list-comprehensions woudl be better in this case, since it isn't a built-in, and you can avoid the function-call overhead of calling the lambda a million times (which is high in Python) – juanpa.arrivillaga Feb 28 '18 at 18:58
  • @juanpa.arrivillaga Thanks for the tip, I learnt something today :) – DBedrenko Feb 28 '18 at 19:46
3

Sorry, but just in raw Information Science terms, you're stuck with an O(N) problem, where N is the number of elements in your list. You have to access each length to test it against batchSize. With that, however, we can stuff it into a one-liner that gives Python a chance to optimize as best it can:

map(len, arraynumbers).count(4)
Prune
  • 76,765
  • 14
  • 60
  • 81
  • 1
    In python 3.x, I believe you need `list(map(...))`. But I don't think creating a list is necessary (see my solution). – jpp Feb 28 '18 at 19:01
  • @jpp yeah but this will almost certainly be faster than anything involving a generator expression. Generators tend to be slow (there's a lot of overhead to generators). Check out my benchmarks – juanpa.arrivillaga Feb 28 '18 at 19:25
  • Yup you are probably right. Both generators and lists have overheads, will need to do more research to see which has more and why. – jpp Feb 28 '18 at 19:26
  • The OP's hope was that by using `numpy` the **O(N)** problem could be performed with compiled code, which runs orders of magnitude faster than interpreted Python. But `numpy` doesn't provide this kind of building block. – hpaulj Feb 28 '18 at 19:37
3

I went ahead and did some timings to show how these different approaches vary.

Note: These are in Python 3:

Note: var_arr has a million randomly-sized sublists:

In [31]: def for_loop(var_arr, batchsize):
    ...:     count = 0
    ...:     for x in var_arr:
    ...:         if len(x) == batchsize:
    ...:             count += 1
    ...:     return count
    ...:

In [32]: def with_map_count(var_arr, batchsize):
    ...:     return list(map(len, var_arr)).count(batchsize)
    ...:

In [33]: def lambda_filter(var_arr, batchsize):
    ...:     len(list(filter(lambda l: len(l) == batchsize, var_arr)))
    ...:

In [34]: def sum_gen(var_arr, batchsize):
    ...:     sum(len(x) == batchsize for x in var_arr)
    ...:

In [35]: from collections import Counter
    ...: def with_counter(var_arr, batchsize):
    ...:     Counter(map(len, var_arr)).get(batchsize, 0)
    ...:

In [36]: %timeit for_loop(var_arr, 4)
82.9 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [37]: %timeit with_map_count(var_arr, 4)
48 ms ± 873 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [38]: %timeit lambda_filter(var_arr, 4)
172 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [39]: %timeit sum_gen(var_arr, 4)
150 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [40]: %timeit with_counter(var_arr, 4)
75.6 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Some more timings:

In [50]: def with_list_comp_filter(var_arr, batchsize):
    ...:     return len([x for x in var_arr if len(x) == batchsize])
    ...:
    ...: def with_list_comp_filter_map(var_arr, batchsize):
    ...:     return len([x for x in map(len, var_arr) if x == batchsize])
    ...:
    ...: def loop_with_map(var_arr, batchsize):
    ...:     count = 0
    ...:     for x in map(len, var_arr):
    ...:         count += x == batchsize
    ...:     return count
    ...:

In [51]: %timeit with_list_comp_filter(var_arr, 4)
87.8 ms ± 4.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [52]: %timeit with_list_comp_filter_map(var_arr, 4)
62.7 ms ± 1.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [53]: %timeit loop_with_map(var_arr, 4)
91.9 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • 1
    What about the list comprehension `len([l for l in arrayNumbers if len(l) == 4])` (DBedrenko's 2nd solution). I ran some tests and that came out to be the fastest. Would be good if you could independently confirm. – pault Feb 28 '18 at 19:14
  • @pault it's definitely not, the `list(map(len, arr)).count(4)` is the fastest so far – juanpa.arrivillaga Feb 28 '18 at 19:18
  • Interesting. I reran my tests with a million elements and I get the list comp to be fastest with the `list(map(len, arr)).count(4)` coming in second. Here is how I made my test array: `arrayNumbers = [[np.random.randint(0, 10)]*np.random.randint(0, 10) for _ in range(1000000)]` – pault Feb 28 '18 at 19:22
  • @pault are you using Python 2? – juanpa.arrivillaga Feb 28 '18 at 19:23
  • Yes python2. I was just about to add that to my comment. – pault Feb 28 '18 at 19:23
  • @pault yes, then use `map(len, arr).count(4)`, using `list` makes you iterate over the million elements again unnecessarily (a testament to it's speed, given that it still comes in second) – juanpa.arrivillaga Feb 28 '18 at 19:24
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/165992/discussion-between-pault-and-juanpa-arrivillaga). – pault Feb 28 '18 at 19:25
2

I ran my own tests in python 2 and it appears that list comprehension (@DBedrenko's updated solution) is the fastest with @Prune's map(len, arraynumbers).count(4) coming in second:

nLists = 1000000
arrayNumbers = [[np.random.randint(0, 10)]*np.random.randint(0, 10) for _ in range(nLists)]
batchSize = 4
In [67]:

%%timeit
counter = 0
for i in range(len(arrayNumbers)):
    if (len(arrayNumbers[i]) == batchSize):
        counter += 1
10 loops, best of 3: 108 ms per loop
In [68]:

%%timeit
map(len, arrayNumbers).count(4)
10 loops, best of 3: 65.7 ms per loop
In [69]:

%%timeit
len(list(filter(lambda l: len(l) == 4, arrayNumbers)))
10 loops, best of 3: 121 ms per loop
In [70]:

%%timeit
len([l for l in arrayNumbers if len(l) == 4])
10 loops, best of 3: 58.6 ms per loop
In [71]:

%%timeit
sum(len(i)==4 for i in arrayNumbers)
10 loops, best of 3: 97.8 ms per loop
pault
  • 41,343
  • 15
  • 107
  • 149
  • 1
    So, Python 2's list-comprehension implementation relies on a hack which essentially shares the local scope wherever it is (which is why it has "leaky" scope]. In Python 3, it get's it's own scope. This may be the cause of the difference, but I'm not too deeply aware of the implementation details to even speculate, really. – juanpa.arrivillaga Feb 28 '18 at 19:47
2

Here is a numpy solution. It is only marginally slower than the best of the non-numpy answers. One advantage would be that one could get the counts for all lengths at minimal additional cost unless there are ridiculously large sublists:

>>> import numpy as np
>>> from timeit import timeit
>>> from collections import Counter
>>> 
>>> lengths = np.random.randint(0, 100, (100_000))
>>> lists = [l * ['x'] for l in lengths]
>>> 
>>> 
# count one
# best Python
>>> list(map(len, lists)).count(16)
974
# numpy
>>> np.count_nonzero(16==np.fromiter(map(len, lists), int, len(lists)))
974
>>> 
# count all
# best Python
>>> [cc for c, cc in sorted(Counter(map(len, lists)).items())]
[973, 1007, 951, 962, 1039, 962, 1028, 999, 970, 997,
 ... 
 1039, 997, 976, 1028, 1026, 969, 1106, 994, 1002, 1022]
>>> 
# numpy
>>> np.bincount(np.fromiter(map(len, lists), int, len(lists)))
array([ 973, 1007,  951,  962, 1039,  962, 1028,  999,  970,  997,
       ...
       1039,  997,  976, 1028, 1026,  969, 1106,  994, 1002, 1022])

Timings:

>>> kwds = dict(globals=globals(), number=100)
>>> 
>>> timeit('list(map(len, lists)).count(16)', **kwds)
0.38265155197586864
>>> timeit('np.count_nonzero(16==np.fromiter(map(len, lists), int, len(lists)))', **kwds)
0.4332483590114862
>>> 
>>> timeit('Counter(map(len, lists))', **kwds)
0.673214758047834
>>> timeit('np.bincount(np.fromiter(map(len, lists), int, len(lists)))', **kwds)
0.43800772598478943
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
1

Using filter with an anonymous function:

>>> Numbers = [[32,3154,53,13],[44,34,25,67],[687,346,75],[57,154]]
>>> filter(lambda x: len(x) == 4, Numbers)
[[32, 3154, 53, 13], [44, 34, 25, 67]]
>>> len(filter(lambda x: len(x) == 4, Numbers))
2
Rodrigo de Azevedo
  • 1,097
  • 9
  • 17