3

First sorry for my not perfect English.

My problem is simple to explain, I think.

result={}
list_tuple=[(float,float,float),(float,float,float),(float,float,float)...]#200k tuples
threshold=[float,float,float...] #max 1k values
for tuple in list_tuple:
    for value in threeshold:
    if max(tuple)>value and min(tuple)<value:
        if value in result:
            result[value].append(tuple)
        else:
            result[value]=[]
            result[value].append(tuple) 

list_tuple contains arround 200k tuples, i have to do this operation very fast(2/3 seconds max on a normal pc).

My first attemp was to do this in cython with prange() (so i could have benefits from the cython optimization and from the paralell execution), but the problem is (as always), GIL: in prange() i can manage lists and tuples using cython memviews, but i can't insert my result in a dict.

In cython i also tried using unordered_map of the c++ std, but now the problem is that i can't make a vector of array in c++ (that would the value of my dict).

The second problem is similar:

list_tuple=[((float,float),(float,float)),((float,float),(float,float))...]#200k tuples of tuples

result={list_tuple[0][0]:[]}

for tuple in list_tuple:
    if tuple[0] in result:
        result[tuple[0]].append(tuple)
    else:
        result[tuple[0]]=[]

Here i have also another problem,if a want to use prange() i have to use a custom hash function to use an array as key of a c++ unordered_map

As you can see my snippets are very simple to run in paralell.

I thought to try with numba, but probably will be the same because of GIL, and i prefer to use cython because i need a binary code (this library could be a part of a commercial software so only binary libraries are allowed).

In general i would like avoid c/c++ function, what i hope to find is a way to manage something like dicts/lists in parallel,with the cython performance, remaining as much as possible in the Python domain; but i'm open to every advice.

Thanks

  • 1
    You can take a look into `numpy` since your problem can be vectorized. First compute the indices of thresholds per tuple and then you can create the dict from the indices. This will yield a significant speedup. – a_guest Jul 27 '18 at 13:13
  • `multiprocessing` can parallelise this task - https://docs.python.org/2/library/multiprocessing.html. – Shiva Jul 27 '18 at 13:18
  • instead of `result={}, .../lines/of/code..result[value]=[]; result[value].append(tuple)`, you can do `result=defaultdict(list)..../lines/of/code....result[value].append(tuple)`. `defaultdict` is available in the `collections` module - `from collections import defaultdict` – Shiva Jul 27 '18 at 13:21
  • @a_guest I did not know what the Vectorization was, i think that you are speaking about that https://en.wikipedia.org/wiki/Array_programming can you esplain better how numpy can help me with a vectorized problem? Can numpy solve my problem in multicore? – Nicolò Monica Jul 27 '18 at 13:25
  • @Shiva yes, I considered this option, but i think that cannot use the multiprocessing native library with cython, and the speed up using cython monocore (so with GIL, dict, ecc) is more than using just python in multithread.For the collections dict you are right, i will use that. – Nicolò Monica Jul 27 '18 at 13:28
  • Since a dictionary is such an integral part of Python, it might hard to parallelize it (`nogil`) or speed up the operations with `c++`. At best you can use the C-API dictionary calls. – hpaulj Jul 27 '18 at 15:35
  • I don't think even using the C++ `unordered_map` can help - [it isn't designed to have multiple threads writing in parallel](https://stackoverflow.com/a/9685609/4657412) – DavidW Jul 27 '18 at 16:23

3 Answers3

1

Several performance improvements can be achieved, also by using numpy's vectorization features:

  1. The min and max values are currently computed anew for each threshold. Instead they can be precomputed and then reused for each threshold.
  2. The loop over data samples (list_tuple) is performed in pure Python. This loop can be vectorized using numpy.

In the following tests I used data.shape == (200000, 3); thresh.shape == (1000,) as indicated in the OP. I also omitted modifications to the result dict since depending on the data this can quickly overflow memory.

Applying 1.

v_min = [min(t) for t in data]
v_max = [max(t) for t in data]
for mi, ma in zip(v_min, v_max):
    for value in thresh:
        if ma > value and mi < value:
            pass

This yields a performance increase of ~ 5 compared to the OP's code.

Applying 1. & 2.

v_min = data.min(axis=1)
v_max = data.max(axis=1)
mask = np.empty(shape=(data.shape[0],), dtype=bool)
for t in thresh:
    mask[:] = (v_min < t) & (v_max > t)
    samples = data[mask]
    if samples.size > 0:
        pass

This yields a performance increase of ~ 30 compared to the OP's code. This approach has the additional benefit that it doesn't contain incremental appends to the lists which can slow down the program since memory reallocation might be required. Instead it creates each list (per threshold) in a single attempt.

a_guest
  • 34,165
  • 12
  • 64
  • 118
0

Edit

Since this approach basically performs an outer product between data samples and threshold values it increases the required memory significantly which might be undesired. An improved approach can be found here. I keep this answer nevertheless for future reference since it was referred to in this answer.

I found the performance increase as compared to the OP's code to be a factor of ~ 20.


This is an example using numpy. The data is vectorized and so are the operations. Note that the resulting dict contains empty lists, as opposed to the OP's example, and hence might require an additional cleaning step, if appropriate.

import numpy as np

# Data setup
data = np.random.uniform(size=(200000, 3))
thresh = np.random.uniform(size=1000)

# Compute tuples for thresholds.
condition = (
    (data.min(axis=1)[:, None] < thresh)
    & (data.max(axis=1)[:, None] > thresh)
)
result = {v: data[c].tolist() for c, v in zip(condition.T, thresh)}
a_guest
  • 34,165
  • 12
  • 64
  • 118
  • Your code might speed up the calculation of the `tuples` `condition`, but it doesn't vectorize the dictionary creation. And typically iteration over an array is slower than iteration over a list (that would apply to the `zip` iteration as well). – hpaulj Jul 27 '18 at 15:32
  • @hpaulj Do you think dict creation is the bottleneck here? The OP's code contains *two* *explicit* Python `for` loops with ~ 200,000,000 iterations which are both vectorized and then reduced to *one* `for` loop consisting of ~ 1,000 iterations (note the factor is the number of samples which is `>>` number of thresholds). Even if array iteration is slower than list iteration this will hardly play a role for the few iterations performed here. Also please feel free to compare the performance of the two solutions in order to convince yourself. – a_guest Jul 27 '18 at 17:11
  • I added some comparisons. For your small example the OP is faster, but for a much larger case your's is faster. – hpaulj Jul 27 '18 at 22:57
0

@a_guest's code:

def foo1(data, thresh):
    data = np.asarray(data)
    thresh = np.asarray(thresh)
    condition = (
       (data.min(axis=1)[:, None] < thresh)
       & (data.max(axis=1)[:, None] > thresh)
       )
    result = {v: data[c].tolist() for c, v in zip(condition.T, thresh)}
    return result

This code creates a dictionary entry once for each item in thresh.

The OP code, simplified a bit with default_dict (from collections):

def foo3(list_tuple, threeshold):
    result = defaultdict(list)
    for tuple in list_tuple:
        for value in threeshold:
            if max(tuple)>value and min(tuple)<value:
                result[value].append(tuple)
    return result

This one updates a dictionary entry once for each item that meets the criteria.

And with his sample data:

In [27]: foo1(data,thresh)
Out[27]: {0: [], 1: [[0, 1, 2]], 2: [], 3: [], 4: [[3, 4, 5]]}
In [28]: foo3(data.tolist(), thresh.tolist())
Out[28]: defaultdict(list, {1: [[0, 1, 2]], 4: [[3, 4, 5]]})

time tests:

In [29]: timeit foo1(data,thresh)
66.1 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# In [30]: timeit foo3(data,thresh)
# 161 µs ± 242 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [31]: timeit foo3(data.tolist(),thresh.tolist())
30.8 µs ± 56.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Iteration on arrays is slower than with lists. Time for tolist() is minimal; np.asarray for lists is longer.

With a larger data sample, the array version is faster:

In [42]: data = np.random.randint(0,50,(3000,3))
    ...: thresh = np.arange(50)
In [43]: 
In [43]: timeit foo1(data,thresh)
16 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [44]: %%timeit x,y = data.tolist(), thresh.tolist() 
    ...: foo3(x,y)
    ...: 
83.6 ms ± 68.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • The performance gain will increase with the number of data samples, as this is the loop which is completely vectorized. The loop over `thresh` is maintained and a vectorized version of this loop is added. According to the OP `thresh.size == 1000` so the added overhead is small. Of course if the number of samples is small too then the overhead (including conversions) dominates. But since `data.shape == (200000, 3)` (according to OP) the performance gain from this loop dominates. Compared to your "big data" example I'd expect the performance gain to increase even more drastically. – a_guest Jul 28 '18 at 09:05