2

I have a data frame that has a shape of (789174, 9). There is a column called resolution that contains a sentence that is less than 139 characters in length. I built a function to find sentences that have a similarity score of above 0.9 from the difflib library. I have a virtual computer with 96 cpus and 384 gb of ram. I have been running this function for longer than 2 hours now and it still has not processed when i = 1000. I am concerned that this will take too long to process and I am wondering if there is a way to speed this up.

def replace_similars(input_list):
    # Replaces %90 and more similar strings
    start_time = time.time()
    for i in range(len(input_list)):
        if i % 1000 == 0:
            print(f'time = {time.time()-start_time:.2f} - index = {i}')
        for j in range(len(input_list)):
            if i < j and difflib.SequenceMatcher(None, input_list[i], input_list[j]).ratio() >= 0.9:
                input_list[j] = input_list[i]

def generate_mapping(input_list):
    new_list = input_list[:]  # copy list
    replace_similars(new_list)

    mapping = {}
    for i in range(len(input_list)):
        mapping[input_list[i]] = new_list[i]

    return mapping

Clearly since we are iterating twice through the column it is O(n^2). I am not sure if there is a way to make this faster. Any suggestions would be greatly appreciated.

EDIT:

I have attempted a speed up using the difflib and fuzzywuzzy. The function only goes through the column once but I do iterate through the dictionary keys.

def cluster_resolution(df):
    clusters = {}
    for string in df['resolution_modified'].unique():
        match1 = difflib.get_close_matches(string, clusters.keys(), cutoff=0.9)
        
        if match1:
            for m in match1:
                clusters[m].append(string)
        else:           
            clusters[string] = [ string ]
            for m in clusters.keys():
                match2 = fuzz.partial_ratio(string, m)
                if match2 >= 90:
                    clusters[m].append(string)
    return clusters
mappings = cluster_resolution(df_sample)

Is it possible to speed up the latter function?

Here is an example of some data in a dataframe

d = {'resolution' : ['replaced scanner', 'replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use', 'tc reimage', 'updated pc', 'deploying replacement scanner', 'upgraded and rebooted station', 'printer has been reconfigured', 'cleared linux print queue and now it is working','user reset her password successfully closing tt','have reset the printer to get it to print again','i plugged usb cable into port and scanner works','reconfigured hand scanner and linked to station','replaced the scanner with station is functional','laptops battery needed to be reset asset serial','reconfigured scanner confirmed that it scans as intended','reimaging laptop corrected the anyconnect software issue','printer was unplugged from usb port working properly now','reconnected usb cable and reassign printer ports on port','reconfigured scanner to base and tested with aa all fine','replaced the defective device with a fresh imaged laptop','reconfigured the printer and the media to print properly','tested printer at station connected and working resolved','red scanner reconfigured and base rebooted via usb joint','station scanner was synced to base and station and is now working','printer offlineswitched usb portprinter is now online and working','replaced the barcode label with one reflecting the tcs ip address','restarted the thin client by using ssh to run the restart command','printer reconfigured and test they are functioning normally again','removed old printer for service installed replacement tested good','tc required reboot rebooted tc had aa signin dp is now functional','resetting the printer to factory settings and then reconfigure it','updated windows os forced update and the laptop operated normally','printer settings are set correct and printer is working correctly','power to printer was disconnected reconnected and is working fine','power cycled equipment and restocked spooler with plastic bubbles','laptop checked ive logged into paskiplacowepl without any problem','reseated scanner cables connection into usb port to resolve issue','the scanner has been replaced and the station is working well now']}

df = pd.DataFrame(data=d)

How I define similarity:

Similarity is really defined by the overall action taken such as replaced scanner and replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use. The longer strings overall action was replacing the scanner thus those two are very similar which is why I chose to use the partial_ratio function since those have a score of 100.

Attention:

Please refer to the second function cluster_resolution as this is the function I would like to sped up. The latter function is not going to useful.

justanewb
  • 133
  • 4
  • 15
  • You're planning to do 600 billion sentence comparisons. Of course it's not going to work. You need to rethink your entire strategy. Perhaps you can find a way to assign a score to each sentence, then sort and compare the numbers. – Tim Roberts Mar 02 '21 at 03:47
  • And, by the way, no matter how many CPUs you have, this code is only using one. If each sentence comparison took 10 milliseconds, this would take 200 years to run. – Tim Roberts Mar 02 '21 at 03:48
  • @TimRoberts I have tried various clustering but that does not get the job done. I thought this would be a good approach, I know ill still end up doing some manual work. Would paralleization help? – justanewb Mar 02 '21 at 04:00
  • The problem is not the code, it's the concept. You're still comparing each sentence to all the sentences that came before. It's still O(n^2). If you've sped up the sentence comparison from 10ms to 1ms, now it will only take 20 years. – Tim Roberts Mar 04 '21 at 04:43
  • 3
    can you provide a small (~25 examples) set of the data? Also, what else can you say about how you define similarity? – anon01 Mar 04 '21 at 06:32
  • How about trying some pre-processing (like a bag-of-words algorithm) to select potential match before using fuzzywuzzy ? As it is vectorized it could be faster. But as @anon01 said, it would be useful to have a sample of the dataset. – tgrandje Mar 04 '21 at 08:02
  • 1
    @anon01 I added an example set of data. Let me know if you want to see more examples. I also added how I define similarity. – justanewb Mar 04 '21 at 17:02
  • cool will come back to take a look – anon01 Mar 04 '21 at 22:59
  • 1
    it will be very slow to calculate an affinity matrix (or any full pairwise comparison metric) - it is probably not the best approach. Is it is accurate to say that want you are trying to optimize is `f(test_sentences) -> < set of similar sentences>`? – anon01 Mar 06 '21 at 07:05

2 Answers2

1
def replace_similars(input_list):
    # Replaces %90 and more similar strings
    start_time = time.time()
    for i in range(len(input_list)):
        if i % 1000 == 0:
            print(f'time = {time.time()-start_time:.2f} - index = {i}')
        for j in range(i+1, len(input_list)):
            if -15 < len(list(input_list[i])) - len(list(input_list[i])) < 15:
                if difflib.SequenceMatcher(None, input_list[i], input_list[j]).ratio() >= 0.9:
                    input_list[j] = input_list[i]

def generate_mapping(input_list):
    new_list = input_list[:]  # copy list
    replace_similars(new_list)

    mapping = {}
    for i in range(len(input_list)):
        mapping[input_list[i]] = new_list[i]

    return mapping

Even though this might not be a practical solution either cause it is also going to take about 90 years if every iteration takes 0.1s , but still it is a lot more optimised solution.

Divyessh
  • 2,540
  • 1
  • 7
  • 24
  • @justanewb To be true I do not recommend this in general, but for iterating through a data this large and getting result in less than a year, I strongly recommend you to use `C` or `C++` or languages faster than `python3`. – Divyessh Mar 04 '21 at 10:19
  • Thanks for the answer, but could you look at the new code? I was able to speed it up using that approach. – justanewb Mar 04 '21 at 17:12
1

Regarding your last edit, I'd make a few changes (mainly using fuzzywuzzy.process rather than fuzzywuzzy.fuzz) :

from fuzzywuzzy import process
def cluster_resolution(df):
    clusters = {}
    for string in df['resolution'].unique():        
        match1 = difflib.get_close_matches(string, clusters.keys(), cutoff=0.9)
        if match1:
            for m in match1:
                clusters[m].append(string)
        else:           
            bests = process.extractBests(
                    string, 
                    set(clusters.keys())-{string},
                    scorer=fuzz.partial_ratio,
                    score_cutoff=80,
                    limit=1
                    )
            
            if bests:
                clusters[bests[0][0]].append(string)
            else:
                clusters[string] = [ string ]

But I think you could look more into other solutions, like CountVectorizer and whatever metric is adapted there. It is a way to gain speed (as it is vectorized), though the results may be imperfect. Note that CountVectorizer could be a good solution for you as you already made the choice of the partial_ratio.

For example, something like this :

from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform
import hdbscan

df = pd.DataFrame(d)

cv = CountVectorizer(stop_words="english")
transformed = cv.fit_transform(df['resolution'])
transformed = pd.DataFrame(
        transformed.toarray(), 
        columns=cv.get_feature_names(),
        index=df['resolution'])

#keep only columns with more than 1
transformed = transformed[transformed.columns[transformed.sum()>2]]

#compute the distance matrix
d = pdist(transformed, metric="hamming") * transformed.shape[1]
s = squareform(d)

clusterer = hdbscan.HDBSCAN(metric='precomputed', min_cluster_size=2)
clusterer.fit_predict(s)

df['labels'] = clusterer.labels_

print(df.sort_values('labels'))

I think this is still perfectible (this is my first shot at text clustering...). You could also add your own list of stopwords for CountVectorizer which would be a way to help the algorithm. At the very least, it could help you pre-cluster your dataset before using your previous function, this way for instance :

df.groupby('labels')['resolution'].apply(cluster_resolution)

(That way if your first clustering is roughly ok, you would only check each value against all other values in the cluster, not all values).

Credits to @anon01 for the computation of the distance matrix in this answer, which seems to give slightly better results than the default of hdbscan.

Edit :

Another try, including :

  • change of the metrics,
  • adding a step with a TF-IDF model,
  • and adding a step to lemmatize the words using nltk package.

So this would be :

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from scipy.spatial.distance import pdist, squareform
import pandas as pd
import hdbscan
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet

d = {...}
df = pd.DataFrame(d)

lemmatizer = WordNetLemmatizer()

def lemmatization(sentence):
    
    tag_dict = {
                "J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV,
                }

    # Tokenize the sentence
    wordsList = nltk.word_tokenize(sentence) 
    
    # Find the right token
    tagged = nltk.pos_tag(wordsList)   
    
    # Convert the list of (token, tag) to lemmatized tokens
    lems = [
            lemmatizer.lemmatize(token, tag_dict.get(tag[0], wordnet.NOUN) )
            for token, tag
            in tagged
            ]

    lems = ' '.join(lems)
    return lems

df['lemmatized'] = df['resolution'].apply(lemmatization)

corpus = df['lemmatized']
pipe = Pipeline(
        [
                ('cv', CountVectorizer(stop_words="english")),
                ('tfid', TfidfTransformer())
         ]).fit(corpus)

transformed = pipe.transform(corpus)
transformed = pd.DataFrame(
        transformed.toarray(), 
        columns=pipe.named_steps['cv'].get_feature_names(),
        index=df['resolution'])

d = pdist(transformed, metric="cosine") * transformed.shape[1]
s = squareform(d)

clusterer = hdbscan.HDBSCAN(metric="precomputed", min_cluster_size=2)
clusterer.fit_predict(s)

df['labels'] = clusterer.labels_

print(df.sort_values('labels'))

You could also add some specific code, as your sample seems to be about very specific maintenance logs.

For instance, you could add new features to the transformed dataframe based on a small list of hardware/sotfware :

#To create a feature about OS :
cols = ['os', 'linux', 'window']
transformed[cols[0]] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))

#To crate a feature about hardware :
cols = ["laptop", "printer", "scanner"]
transformed["hardware"] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))

This step could help have better results but may not be necessary. I'm not sure of how it will compares to FuzzyWuzzy's performance for matching strings, but I will be interested in your feedback !

tgrandje
  • 2,332
  • 11
  • 33