6

Some days I just hate using middleware. Take this for example: I'd like to have a lookup table that maps values from a set of inputs (domain) values, to outputs (range) values. The mapping is unique. A Python map can do this, but since the map is quite big I figured, why not use a ps.Series and its index, which has added benefit that I can:

  • pass in multiple values to be mapped as a series (hopefully faster than dictionary lookup)
  • the original series' index in maintained in the result

like so:

domain2range = pd.Series(allrangevals, index=alldomainvals)
# Apply the map
query_vals = pd.Series(domainvals, index=someindex)
result = query_vals.map(domain2range)
assert result.index is someindex # Nice
assert (result.values in allrangevals).all() # Nice

Works as expected. But not. The above .map's time cost grows with len(domain2range) not (more sensibly) O(len(query_vals)) as can be shown:

numiter = 100
for n in [10, 1000, 1000000, 10000000,]:
    domain = np.arange(0, n)
    range = domain+10
    maptable = pd.Series(range, index=domain).sort_index()

    query_vals = pd.Series([1,2,3])
    def f():
        query_vals.map(maptable)
    print n, timeit.timeit(stmt=f, number=numiter)/numiter


10 0.000630810260773
1000 0.000978469848633
1000000 0.00130645036697
10000000 0.0162791204453

facepalm. At n=10000000 its taken (0.01/3) second per mapped value.

So, questions:

  • is Series.map expected to behave like this? Why is it so utterly, ridiculously slow? I think I'm using it as shown in the docs.
  • is there a fast way to use pandas to do table-lookup. It seems like the above is not it?
user48956
  • 14,850
  • 19
  • 93
  • 154
  • @TemporalWolf I'm gonna says that if it takes (1.44/3) seconds per query value, its not optimized for anything. – user48956 May 31 '18 at 23:08
  • The above code is runnable and provides the data. The maptable maps values from 0..n ==> 10..n+10, and the query is always [1,2,3]. – user48956 May 31 '18 at 23:21
  • You do realize that passing `number=10` to `timeit` is measuring the time to execute 10 times, right? – miradulo May 31 '18 at 23:24
  • OK. Divide by 10. Still unreasonable – user48956 May 31 '18 at 23:26
  • I've updated the question with correct timings. Still very very slow. – user48956 May 31 '18 at 23:31
  • Your comparison is a bit flawed. You can't simply divide by 3 and call that the time per value. You're paying a massive overhead only mapping 3 values against an enormous dictionary. – ALollz May 31 '18 at 23:33
  • For instance, change `query_vals = pd.Series(np.tile([1,2,3],n))` and you'll see that your calculated time per value doesn't scale as you would expect – ALollz May 31 '18 at 23:34
  • @ALollz Sure -- but I pay the same overhead when n=10. Costs are increasing significantly as the size of the map table increases. This is not right. – user48956 May 31 '18 at 23:34
  • Speedups: In this particular case, use `maptable.iloc[query_vals]`. Though I imagine you're more interested in the general case - use `query_vals.map(maptable.get)`, or replace `maptable` with a dict and `.get` to speed it up further. – miradulo May 31 '18 at 23:52
  • Hashing NumPy dtypes is much slower than hashing Python integers. Compare `hash(0)` to `hash(np.int32(0))` for instance. – miradulo Jun 01 '18 at 00:00
  • Perhaps -- but that does explain the behavior. As n grows, the cost approximates O(mapsize). It should be MUCH less (O(len_query) for hashed lookup, or O(len_query*log(mapsize)) for binary search on a sorted index. In the tested function, the hash would only need to be applied to the 3 query values, not the whole map. – user48956 Jun 01 '18 at 00:04

1 Answers1

3

https://github.com/pandas-dev/pandas/issues/21278

Warmup was the issue. (double facepalm). Pandas silently builds and caches a hash index at first use (O(maplen)). Calling the tested function and prebuilding the indexgets much better performance.

numiter = 100
for n in [10, 100000, 1000000, 10000000,]:
    domain = np.arange(0, n)
    range = domain+10
    maptable = pd.Series(range, index=domain) #.sort_index()

    query_vals = pd.Series([1,2,3])

    def f1():
        query_vals.map(maptable)
    f1()
    print "Pandas1 ", n, timeit.timeit(stmt=f1, number=numiter)/numiter

    def f2():
        query_vals.map(maptable.get)
    f2()
    print "Pandas2 ", n, timeit.timeit(stmt=f2, number=numiter)/numiter

    maptabledict = maptable.to_dict()
    query_vals_list = pd.Series([1,2,3]).tolist()

    def f3():
        {k: maptabledict[k] for k in query_vals_list}
    f3()
    print "Py dict ", n, timeit.timeit(stmt=f3, number=numiter)/numiter
    print

pd.show_versions()
Pandas1  10 0.000621199607849
Pandas2  10 0.000686831474304
Py dict  10 2.0170211792e-05

Pandas1  100000 0.00149286031723
Pandas2  100000 0.00118808984756
Py dict  100000 8.47816467285e-06

Pandas1  1000000 0.000708899497986
Pandas2  1000000 0.000479419231415
Py dict  1000000 1.64794921875e-05

Pandas1  10000000 0.000798969268799
Pandas2  10000000 0.000410139560699
Py dict  10000000 1.47914886475e-05

... although a little depressing that python dictionaries are 10x faster.

user48956
  • 14,850
  • 19
  • 93
  • 154