My python code is taking 8+ hours to iterate over big data

Question

Update: It's been over 24 hours and the code is still not done yet :)

I have this python code below. Basically, this code is currently only using 1% of the dataset (that's why it's called sample). It has 32968 rows of just names. I cleaned out the punctuation and put it all in lowercase already.

My problem is that this code has been running for 8 hours so far and isn't done yet. Since, as mentioned, I'm only using 1% of the data, I will need to run this code again later on the entire dataset, which would take 100x the time for this one. I don't think waiting 800 hours is a good idea. So for my question:

Are there any ways I can make it faster? Should learn about spark or mapreduce and try to utilize those for this code?

EDIT: Okay, I will try to add more information on what the code is actually doing. An example of the names before they are cleaned:

import pandas as pd
import numpy as np

data = {'clean_name': ['Abbott Laboratories','Apple Computers', 'Apple, Inc.', 'Abercrombie & Fitch Co.', 'ABM Industries Incorporated', 'Ace Hardware Corporation'], 'name_group': np.zeros(6, dtype=int)}

sample = pd.DataFrame(data)
sample

Out[2]: 
                    clean_name  name_group
0          Abbott Laboratories           0
1              Apple Computers           0
2                  Apple, Inc.           0
3      Abercrombie & Fitch Co.           0
4  ABM Industries Incorporated           0
5     Ace Hardware Corporation           0

Then, I clean it of punctuation and putting it all in lowercase. Basically, I want to compare each name with the next and if it's similar, I'll give it the same group number. Something like this:

sample
Out[28]: 
                    clean_name  name_group
0          abbott laboratories           0
1              apple computers           1
2                  apple  inc            1
3      abercrombie   fitch co            0
4  abm industries incorporated           0
5     ace hardware corporation           0

The code below is what I came up with:

i = 1
for alpha,beta in itertools.combinations(sample.clean_name, 2):
    score = fuzz.token_sort_ratio(alpha, beta)
    A = sample.loc[sample.clean_name==alpha, 'name_group'].values[0]
    B = sample.loc[sample.clean_name==beta, 'name_group'].values[0]
    if score > 60:
        if ((B != 0) & (A !=0)): continue
        if ((A == 0) & (B !=0)): A = B
        elif ((B == 0) & (A !=0)): B = A
        elif ((B == 0) & (A ==0)):
            A, B = i, i
            i += 1
        sample.loc[sample.clean_name==alpha, 'name_group'] = A
        sample.loc[sample.clean_name==beta, 'name_group'] = B

You seriously sat for 8 hours and timed this code? Anyway, I think it would help if you were to show some data, expected output, and what you're trying to do. No one can do anything more than surface touch ups without any context whatsoever. — cs95, Oct 04 '17 at 22:34
If you reduce your data set to one which runs in a reasonable time and run it against the profiler do you get any illuminating information? — Paul Rooney, Oct 04 '17 at 22:36
Actually, since your algorithm does n-choose-2 iterations, i.e. `(1/2)*(n - 1)*n` i.e., you have quadratic complexity, so if you increase the size by 100, you can expect the time required to increase by about 5000, so if it really took 8 hours on 1% of the data, I'd guestimate more like 40,000 hours on 100% of the data — juanpa.arrivillaga, Oct 04 '17 at 22:37
+1 for the suggestion of giving more context. It feels like you're doing a set consolidation with a fuzzy equality (clustering with your token_sort_ratio), but that's only a guess. — DSM, Oct 04 '17 at 22:46
And looking at this even more closely, it seems like *inside* your loop you do on operation that depends linearly on the total size, i.e. `sample.clean_name==alpha`, so it seems like you actually have *cubic* complexity, so probably closer to 4,000,000 hours, which is about 456.6210046 years. So, 800 hours wouldn't have been that bad... — juanpa.arrivillaga, Oct 04 '17 at 23:01
Thanks juanpa.arrivillaga for confirming that this is a really bad idea after all. I edited my question now with alot more info — Shooth, Oct 04 '17 at 23:20
@Shooth so you say "Basically, I want to compare each name with the next..." but that is **not** what you are doing. You are comparing each name *with every other name*. — juanpa.arrivillaga, Oct 05 '17 at 02:42
yes, I have to do that, because, in my example, apple inc could be at the bottom of the list. So, the only way (I can think of) to compare apple computers with it, is to compare it with everything, and so on. — Shooth, Oct 05 '17 at 10:55
you're gonna need a way to reject matches based on things like "inc", "co", "LLC", etc.. — Aaron, Oct 05 '17 at 15:02
Changing the code (porting to spark or mapreduce) will not help, because the algorithm is just so slow. You should think about an algorithm which is linearly or `O(n log(n))` in the amount of names. Maybe [this question](https://stackoverflow.com/questions/8848991/python-digest-hash-for-string-similarity) can help you. — syntonym, Oct 05 '17 at 15:02
@Shooth I found a [thing...](http://www.logarithmic.net/pfh/blog/01164790008) this is a bit beyond my complete understanding, but seems like the right direction. You'd likely want to separate individual words for the comparisons, but they could be linked back together afterwards and combined for a more robust heuristic.. — Aaron, Oct 05 '17 at 15:10
it seems to me like you basically need a spell check algorithm to find similar matches in a large dictionary. there's tons of research in the subject of spell checkers and how to make them fast for huge lists of words. you just need to combine several words into one search.. — Aaron, Oct 05 '17 at 15:18
@Aaron, yes I know, I didn't mention it because it wasn't important, but I filtered out ['inc', 'corporation', 'corp', 'company', 'group', 'co','international' 'incorporated', 'et', 'al'] in my actual code. — Shooth, Oct 05 '17 at 15:23
@syntonym and Aaron I'll see the 2 links you guys posted and try to understand if I can use them... — Shooth, Oct 05 '17 at 15:26
@Aaron I think for that to work you need a list of "true names"/"clean names". AT Shooth Do you have something like that available? Or do you have no idea which names will appear? — syntonym, Oct 05 '17 at 15:40
@syntonym I only have a dataframe with 3.3million names. If I looked closely at each one individually, it will be easy to tell if it's correct or not. I'm not sure if a dictionary would work because something like "Jake Industries" and Jaek Industries" needs to be matched, but there's no "jake" in the English dictionary — Shooth, Oct 05 '17 at 15:44
are these for publicly listed corps? if so, why not "anchor" each entry to the baseline company name they are listed under? easier to match apple under 3 diff variations to a baseline than each variation to each other. — JL Peyret, Oct 06 '17 at 04:11
I imagine may be possible with a 2D numpy matrix where diagnols values for each company. Then apply a comparison function between the diagnols to the off-diagnols via broadcasting, which should improve performance. — pylang, Oct 06 '17 at 04:22
See post on a numpy approach. This may give you another way to look at the problem even if it is not the best solution. — pylang, Oct 10 '17 at 02:58

pylang · Answer 1 · 2017-10-24T04:36:38.073

Using itertools.combinations for 32k rows will certainly make your code slow. Here is an approach using numpy instead of pandas on a smaller dataset to address the following objectives:

Implement a function for grouping company names by some condition (a predicate)
Make the function(s) faster than the posted implementation

Use this post as a means to attack your problem from a different angle.

Given

Here we build a small list of company names A, B, C and Aa:

import itertools as it
import collections as ct

import numpy as np


companies = "A B C Aa".split()

Code

Step 1

First we will create a 2D array where the horizontal and vertical indices are identical company names. Inside the matrix will comprise the merged company names:

# 1. Build a 2D array of joined strings
def get_2darray(seq):
    """Return a 2D array of identical axes."""
    x = np.array(seq)
    y = np.array(seq)

    xx = x[:, np.newaxis]
    yy = y[np.newaxis, :]

    return np.core.defchararray.add(xx, yy)                # ref 001

Demo

arr = get_2darray(companies)
arr
# array([['AA', 'AB', 'AC', 'AAa'],
#        ['BA', 'BB', 'BC', 'BAa'],
#        ['CA', 'CB', 'CC', 'CAa'],
#        ['AaA', 'AaB', 'AaC', 'AaAa']], 
#       dtype='<U4')

Step 2

Second we implement a group function for enumerating similar companies. Given a 2D array, a helper function (func) will be used to "transform" each element to a group number:

# 2. Group companies by "similarity", e.g. "AB" == "BA"
def group(func, arr, pred=None, verbose=False):
    """Return an array of items enumerated by similarity."""
    if pred is None:
        # Set diagnol to zero
        pred = lambda x: len(set(x)) != len(x)

    dd = ct.defaultdict(it.count().__next__)
    dd[""] = np.nan 

    # opt_func = np.vectorize(func)                        # optional, slower
    opt_func = np.frompyfunc(func, 3, 1)                   # ref 002

    m = opt_func(arr, dd, pred)
    if verbose: print(dd)
    return m


def transform(item, lookup, pred):
    """Return a unique group number element-wise."""
    unique_idx = "".join(sorted(item.lower()))
    name_group = lookup[unique_idx]
    if pred(item):
        return 0
    else:
        return name_group

Demo

groups = group(transform, arr, verbose=True)
groups
# defaultdict(<method-wrapper '__next__' of itertools.count object at 0x00000000062BE408>,
# {'': nan, 'aaa': 3, 'aac': 8, 'ab': 1, 
# 'cc': 7, 'aa': 0, 'bc': 5, 'aaaa': 9,
# 'ac': 2, 'bb': 4, 'aab': 6})

# array([[0, 1, 2, 0],
#        [1, 0, 5, 6],
#        [2, 5, 0, 8],
#        [0, 6, 8, 0]], dtype=object)

Each company name is grouped with a unique number.

Step 3

You can now access the group number for two companies by slicing the groups array:

# 3. Lookup the group number shared by companies
reversed_lookup = {v:k for k, v in enumerate(companies)}

def group_number(arr, a, b):
    """Return the name_group given company names, in 2D array `m`."""
    i, j = reversed_lookup[a], reversed_lookup[b]
    return arr[i, j]


for companies in [["B", "C"], ["A", "B"], ["C", "C"]]:
    msg = "Companies {}: group {}"
    print(msg.format(" & ".join(companies), group_number(groups, *companies)))


# Companies B & C: group 5
# Companies A & B: group 1
# Companies C & C: group 0

Details

Step 1

Why use arrays? A numpy array allows for fast lookups just like pandas. In addition, we can later optimize performance using operations implemented at the C level (these are fast).

Why merge company names in the array? The 2D array of merged strings is used for comparing company names. This way of comparison resembles a statistical correlation matrix.

Step 2

How are groups determined? The company names are passed to a special dictionary (dd) that only assigns an incremented integer when a new key is found. This dictionary is used to track groups as the transform helper function is applied to each element.

Why use a helper function? The tranform function converts each item in the array to a group number. Notice the tracking dictionary (lookup) is passed in with a predicate. Here are some notes about these group parameters:

The keys of the tracking dictionary are made by lowering and sorting a given string. This technique is used internally for equating strings with swapped company names. For example, merged companies "AB" and "BA" should belong to the same group.
The predicate is determined by the user. If no predicate is given (pred=None), a default predicate is applied, which naively compares strings with identical names (particularly along the diagnol).

You may wish to use another predicate. For example, from the default predicate, any set of a lowered strings is equivalent such that A == Aa == AaAa (see the corners of the array are assigned to group 0). Here is a another sample predicate that distinguishes A from Aa (groups 0 and 3 respectively):

pred = lambda x: all(not(v%2) for k, v in ct.Counter(x).items())
group(transform, arr, pred)
# array([[0, 1, 2, 3],
#        [1, 0, 5, 6],
#        [2, 5, 0, 8],
#        [3, 6, 8, 0]], dtype=object)

How is performance optimized? Some operations are vectorized to help speed up the code using C implementations. In the group function, numpy.frompyfun wraps the helper function. It was determined that this particular "universal function" is faster than a vectorizing function numpy.vectorize. See also Scipy Lecture Notes on more ways to optimize numpy code.

Step 3

How to find a group number for two companies? This is done simply by slicing the returned array from the group function. group_number is a slicing function for querying the array. Since the indices are now numeric from Step 2, we build a reversed dictionary from our starting, ordered sequence companies to find the corresponding numeric index by company name. Note, the reversed dictionary is built outside of the slicing function to avoid rebuilding the dictionary after each query.

Performance

How fast is it? For our simple sequence of < 10 rows, the speed is sub-milliseconds:

%timeit group(transform, arr)
# 10000 loops, best of 3: 110 µs per loop

For demonstration, let's scale this data up around 1000 rows (beyond that, even creating of the dataset takes long and consumes memory).

test = tuple(map(str, range(1000)))
full_arr = get_2darray(test)
print(full_arr.shape)
full_arr
# (1000, 1000)
# array([['00', '01', '02', ..., '0997', '0998', '0999'],
#        ['10', '11', '12', ..., '1997', '1998', '1999'],
#        ['20', '21', '22', ..., '2997', '2998', '2999'],
#        ..., 
#        ['9970', '9971', '9972', ..., '997997', '997998', '997999'],
#        ['9980', '9981', '9982', ..., '998997', '998998', '998999'],
#        ['9990', '9991', '9992', ..., '999997', '999998', '999999']], 
#       dtype='<U6')

%timeit group(transform, full_arr)
# 1 loop, best of 3: 5.3 s per loop

Save some computation time by evaluating only half of the matrix:

half_arr = np.triu(test)
half_arr
# array([['00', '01', '02', ..., '0997', '0998', '0999'],
#        ['', '11', '12', ..., '1997', '1998', '1999'],
#        ['', '', '22', ..., '2997', '2998', '2999'],
#        ..., 
#        ['', '', '', ..., '997997', '997998', '997999'],
#        ['', '', '', ..., '', '998998', '998999'],
#        ['', '', '', ..., '', '', '999999']], 
#       dtype='<U6')

%timeit group(transform, half_arr)
# 1 loop, best of 3: 3.61 s per loop

Note: profiling was not performed on a dataset of 32k rows.

Conclusions

In this approach, the aforementioned objectives were acheived by:

separating data munging and evaluation of a small dataset into the Steps 1 and 2.
analysis by slicing the final numpy array of grouped companies in Step 3.

Consider numpy for optimizing comparison functions at the C-level. While the performance tests in this post may still take time, numpy offers headroom for further optimizations. Moreover, it is probable that this code, as is, will take less time than 8 hours on the OP's dataset. Further profiling is required to assess the complexity of this approach. If the complexity is reasonable, the user may decide how to proceed, e.g. parallel processing on multiple threads. Such tasks are left to those whom may be interested.

References

001: How to merge strings in a numpy array
002: Vectorizing functions

My python code is taking 8+ hours to iterate over big data

1 Answers1