Using itertools.combinations
for 32k rows will certainly make your code slow. Here is an approach using numpy instead of pandas on a smaller dataset to address the following objectives:
- Implement a function for grouping company names by some condition (a predicate)
- Make the function(s) faster than the posted implementation
Use this post as a means to attack your problem from a different angle.
Given
Here we build a small list of company names A
, B
, C
and Aa
:
import itertools as it
import collections as ct
import numpy as np
companies = "A B C Aa".split()
Code
Step 1
First we will create a 2D array where the horizontal and vertical indices are identical company names. Inside the matrix will comprise the merged company names:
# 1. Build a 2D array of joined strings
def get_2darray(seq):
"""Return a 2D array of identical axes."""
x = np.array(seq)
y = np.array(seq)
xx = x[:, np.newaxis]
yy = y[np.newaxis, :]
return np.core.defchararray.add(xx, yy) # ref 001
Demo
arr = get_2darray(companies)
arr
# array([['AA', 'AB', 'AC', 'AAa'],
# ['BA', 'BB', 'BC', 'BAa'],
# ['CA', 'CB', 'CC', 'CAa'],
# ['AaA', 'AaB', 'AaC', 'AaAa']],
# dtype='<U4')
Step 2
Second we implement a group
function for enumerating similar companies. Given a 2D array, a helper function (func
) will be used to "transform" each element to a group number:
# 2. Group companies by "similarity", e.g. "AB" == "BA"
def group(func, arr, pred=None, verbose=False):
"""Return an array of items enumerated by similarity."""
if pred is None:
# Set diagnol to zero
pred = lambda x: len(set(x)) != len(x)
dd = ct.defaultdict(it.count().__next__)
dd[""] = np.nan
# opt_func = np.vectorize(func) # optional, slower
opt_func = np.frompyfunc(func, 3, 1) # ref 002
m = opt_func(arr, dd, pred)
if verbose: print(dd)
return m
def transform(item, lookup, pred):
"""Return a unique group number element-wise."""
unique_idx = "".join(sorted(item.lower()))
name_group = lookup[unique_idx]
if pred(item):
return 0
else:
return name_group
Demo
groups = group(transform, arr, verbose=True)
groups
# defaultdict(<method-wrapper '__next__' of itertools.count object at 0x00000000062BE408>,
# {'': nan, 'aaa': 3, 'aac': 8, 'ab': 1,
# 'cc': 7, 'aa': 0, 'bc': 5, 'aaaa': 9,
# 'ac': 2, 'bb': 4, 'aab': 6})
# array([[0, 1, 2, 0],
# [1, 0, 5, 6],
# [2, 5, 0, 8],
# [0, 6, 8, 0]], dtype=object)
Each company name is grouped with a unique number.
Step 3
You can now access the group number for two companies by slicing the groups
array:
# 3. Lookup the group number shared by companies
reversed_lookup = {v:k for k, v in enumerate(companies)}
def group_number(arr, a, b):
"""Return the name_group given company names, in 2D array `m`."""
i, j = reversed_lookup[a], reversed_lookup[b]
return arr[i, j]
for companies in [["B", "C"], ["A", "B"], ["C", "C"]]:
msg = "Companies {}: group {}"
print(msg.format(" & ".join(companies), group_number(groups, *companies)))
# Companies B & C: group 5
# Companies A & B: group 1
# Companies C & C: group 0
Details
Step 1
Why use arrays? A numpy array allows for fast lookups just like pandas. In addition, we can later optimize performance using operations implemented at the C level (these are fast).
Why merge company names in the array? The 2D array of merged strings is used for comparing company names. This way of comparison resembles a statistical correlation matrix.
Step 2
How are groups determined? The company names are passed to a special dictionary (dd
) that only assigns an incremented integer when a new key is found. This dictionary is used to track groups as the transform
helper function is applied to each element.
Why use a helper function? The tranform
function converts each item in the array to a group number. Notice the tracking dictionary (lookup
) is passed in with a predicate. Here are some notes about these group
parameters:
- The keys of the tracking dictionary are made by lowering and sorting a given string. This technique is used internally for equating strings with swapped company names. For example, merged companies "AB" and "BA" should belong to the same group.
- The predicate is determined by the user. If no predicate is given (
pred=None
), a default predicate is applied, which naively compares strings with identical names (particularly along the diagnol).
You may wish to use another predicate. For example, from the default predicate, any set of a lowered strings is equivalent such that A == Aa == AaAa
(see the corners of the array are assigned to group 0). Here is a another sample predicate that distinguishes A
from Aa
(groups 0 and 3 respectively):
pred = lambda x: all(not(v%2) for k, v in ct.Counter(x).items())
group(transform, arr, pred)
# array([[0, 1, 2, 3],
# [1, 0, 5, 6],
# [2, 5, 0, 8],
# [3, 6, 8, 0]], dtype=object)
How is performance optimized? Some operations are vectorized to help speed up the code using C implementations. In the group
function, numpy.frompyfun
wraps the helper function. It was determined that this particular "universal function" is faster than a vectorizing function numpy.vectorize
. See also Scipy Lecture Notes on more ways to optimize numpy code.
Step 3
How to find a group number for two companies? This is done simply by slicing the returned array from the group
function. group_number
is a slicing function for querying the array. Since the indices are now numeric from Step 2, we build a reversed dictionary from our starting, ordered sequence companies
to find the corresponding numeric index by company name. Note, the reversed dictionary is built outside of the slicing function to avoid rebuilding the dictionary after each query.
Performance
How fast is it? For our simple sequence of < 10 rows, the speed is sub-milliseconds:
%timeit group(transform, arr)
# 10000 loops, best of 3: 110 µs per loop
For demonstration, let's scale this data up around 1000 rows (beyond that, even creating of the dataset takes long and consumes memory).
test = tuple(map(str, range(1000)))
full_arr = get_2darray(test)
print(full_arr.shape)
full_arr
# (1000, 1000)
# array([['00', '01', '02', ..., '0997', '0998', '0999'],
# ['10', '11', '12', ..., '1997', '1998', '1999'],
# ['20', '21', '22', ..., '2997', '2998', '2999'],
# ...,
# ['9970', '9971', '9972', ..., '997997', '997998', '997999'],
# ['9980', '9981', '9982', ..., '998997', '998998', '998999'],
# ['9990', '9991', '9992', ..., '999997', '999998', '999999']],
# dtype='<U6')
%timeit group(transform, full_arr)
# 1 loop, best of 3: 5.3 s per loop
Save some computation time by evaluating only half of the matrix:
half_arr = np.triu(test)
half_arr
# array([['00', '01', '02', ..., '0997', '0998', '0999'],
# ['', '11', '12', ..., '1997', '1998', '1999'],
# ['', '', '22', ..., '2997', '2998', '2999'],
# ...,
# ['', '', '', ..., '997997', '997998', '997999'],
# ['', '', '', ..., '', '998998', '998999'],
# ['', '', '', ..., '', '', '999999']],
# dtype='<U6')
%timeit group(transform, half_arr)
# 1 loop, best of 3: 3.61 s per loop
Note: profiling was not performed on a dataset of 32k rows.
Conclusions
In this approach, the aforementioned objectives were acheived by:
- separating data munging and evaluation of a small dataset into the Steps 1 and 2.
- analysis by slicing the final numpy array of grouped companies in Step 3.
Consider numpy for optimizing comparison functions at the C-level. While the performance tests in this post may still take time, numpy offers headroom for further optimizations. Moreover, it is probable that this code, as is, will take less time than 8 hours on the OP's dataset. Further profiling is required to assess the complexity of this approach. If the complexity is reasonable, the user may decide how to proceed, e.g. parallel processing on multiple threads. Such tasks are left to those whom may be interested.
References