1

I have a list of sentences that has around 500,000 sentences. And also a list of concepts that have around 13,000,000 concepts. For each sentence I want to extract concepts from sentences in the order of sentence and write it to output.

For example, my python programme looks as follows.

import re

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process', 
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

output = []
counting = 0

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall

for sentence in sentences:
    output.append(find_all_concepts(sentence))

print(output)

The output is; [['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems'], ['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'knowledge discovery', 'databases process']]

However, the order of the output is not important to me. i.e my output could also look likes follows (In other words the lists inside the output can be shuffled).

[['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'knowledge discovery', 'databases process'], ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems']]

[['data mining', 'knowledge discovery', 'databases process'], ['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems']]

However, due to to the length of my sentences and concepts this program is still quite slow.

Is it possible to further improve the performance (in terms of time) using multithreading in python?

EmJ
  • 4,398
  • 9
  • 44
  • 105
  • 7
    Multithreading in python is great when your threads are mainly waiting - for network resources, web requests, DB queries, disk I/O, etc. In this case it looks like your are CPU bound, in which case, multithreading wont help at all. – Danielle M. Jan 07 '19 at 00:36
  • 1
    There are lots of other things I'd try before multithreading. Pypy, cython, smarter algorithm, etc. – Jared Smith Jan 07 '19 at 01:23
  • How may sentences in the data? How many concepts? – wwii Jan 07 '19 at 03:29
  • @wwii I have mentioned it in the question (500,000 sentences and 13,000,000 concepts) – EmJ Jan 07 '19 at 03:32
  • 1
    How long does it take to create `find_all_concepts`? – wwii Jan 07 '19 at 03:53
  • @wwii that is the problem, for one sentence to process in `find_all_concepts` it takes like 3-5 minutes. However, my concepts list is alphabetically ordered. So, is it possible to reduce this time taking advantage of alphabetical order? – EmJ Jan 07 '19 at 03:55
  • Probably not- you have single word and multiple word concepts - no way to sort the sentence words to match that. – wwii Jan 07 '19 at 04:15
  • 1
    What is the maximum concept word length - 2,3,4? – wwii Jan 07 '19 at 04:19
  • @wwii there are some words with 8-10 word length. But most of the words are in the range of 2-4 – EmJ Jan 07 '19 at 04:21
  • 1
    Python [Multiprocessing vs Threading - pros/cons](https://stackoverflow.com/a/3046201/2823755) SO answer. There are other good answers on that page. – wwii Jan 07 '19 at 16:35
  • 1
    Is the data available online? – wwii Jan 07 '19 at 17:09
  • @wwii no, the data is not available online. But I can provide more data than mentioned in the question if required :) – EmJ Jan 07 '19 at 22:25

3 Answers3

2

Whether multi-threading will yield an actual performance increase, does not just depend on the implementation in Python and the amount of data, it also depends on the hardware executing the program. In some cases, where the hardware offers no advantage, multi-threading may end up slowing things down due to increased overhead.

However, assuming you're running on a modern standard PC or better, you may see some improvement with multi-threading. The problem then is to set up a number of workers, pass the work to them and collect the results.

Staying close to your example structure, implementation and naming:

import re
import queue
import threading

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process',
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall


def do_find_all_concepts(q_in, l_out):
    while True:
        sentence = q_in.get()
        l_out.append(find_all_concepts(sentence))
        q_in.task_done()


# Queue with default maxsize of 0, infinite queue size
sentences_q = queue.Queue()
output = []

# any reasonable number of workers
num_threads = 2
for i in range(num_threads):
    worker = threading.Thread(target=do_find_all_concepts, args=(sentences_q, output))
    # once there's nothing but daemon threads left, Python exits the program
    worker.daemon = True
    worker.start()

# put all the input on the queue
for s in sentences:
    sentences_q.put(s)

# wait for the entire queue to be processed
sentences_q.join()
print(output)

User @wwii asked about multiple threads not really affecting performance for cpu-bound problems. Instead of using multiple threads, accessing the same output variable, you could also use multiple processes, access a shared output queue, like this:

import re
import queue
import multiprocessing

sentences = [
    'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
    'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
    'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process',
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall


def do_find_all_concepts(q_in, q_out):
    try:
        while True:
            sentence = q_in.get(False)
            q_out.put(find_all_concepts(sentence))
    except queue.Empty:
        pass


if __name__ == '__main__':
    # default maxsize of 0, infinite queue size
    sentences_q = multiprocessing.Queue()
    output_q = multiprocessing.Queue()

    # any reasonable number of workers
    num_processes = 2
    pool = multiprocessing.Pool(num_processes, do_find_all_concepts, (sentences_q, output_q))

    # put all the input on the queue
    for s in sentences:
        sentences_q.put(s)

    # wait for the entire queue to be processed
    pool.close()
    pool.join()
    while not output_q.empty():
        print(output_q.get())

More overhead still, but using CPU resources available on other cores as well.

Grismar
  • 27,561
  • 4
  • 31
  • 54
  • Thanks a lot. Can we improve the number of threads to like 10, assuming that I have increased hardware? – EmJ Jan 07 '19 at 01:13
  • 1
    Yes, just change the `num_threads = 2` line to whatever you think is optimal. You could test a bit with a smaller set (but larger than given in the example) and use some timing functions to decide what number of threads is optimal for your system. Bigger is not always better, as your hardware and OS can only do so much in parallel. – Grismar Jan 07 '19 at 01:28
  • It depends @wwii, but since most modern CPU's have multiple cores and feature hyper-threading, having multiple threads could allow the program to make more efficient use of the available compute resources. Of course, resources like memory I/O, network I/O and disk I/O may be the actual bottlenecks, depending on the problem you're trying to solve. But since you ask about 'cpu bound' tasks, yes, as long as there are unused CPU resources that would be accessed by multi-threading. – Grismar Jan 08 '19 at 01:34
  • I'm sorry @wwii, I misinterpreted - disregard that. If they are truly cpu bound, i.e. bottlenecked by the CPU, already fully utilising the capacity of the CPU they are running on - then probably not. You'd have to look for solutions using multiprocessing instead of multithreading. – Grismar Jan 08 '19 at 02:36
1

Here are two solutions using concurrent.futures.ProcessPoolExecutor which will distribute the tasks to different processes. Your task appears to be cpu bound not i/o bound so threads probably won't help.

import re
import concurrent.futures

# using the lists in your example

re_concepts = [re.escape(t) for t in concepts]
all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL)

def f(sequence, regex=all_concepts):
    result = regex.findall(sequence)
    return result

if __name__ == '__main__':

    out1 = []
    with concurrent.futures.ProcessPoolExecutor() as executor:
        futures = [executor.submit(f, s) for s in sentences]
        for future in concurrent.futures.as_completed(futures):
            try:
                result = future.result()
            except Exception as e:
                print(e)
            else:
                #print(result)
                out1.append(result)   

    out2 = []
    with concurrent.futures.ProcessPoolExecutor() as executor:
        for result in executor.map(f, sentences):
            #print(result)
            out2.append(result)

Executor.map() has a chunksize parameter: the docs say sending chunks of greater than one item of the iterable could be beneficial. The function would need to be refactored to account for that. I tested this with a function that would just return what it was sent but regardless of the chunksize I specified the test function only returned single items. ¿go figure?

def h(sequence):
    return sequence

One drawback with Multiprocessing is that the data must be serialized/pickled to be sent to the process which takes time and might be significant for a compiled regular expression that large - it might defeat the gains from multiple processes.

I made a set of 13e6 random strings with 20 characters each to approximate your compiled regex.

data =set(''.join(random.choice(string.printable) for _ in range(20)) for _ in range(13000000))

Pickling to an io.BytesIO stream takes about 7.5 seconds and unpickling from a io.BytesIO stream takes 9 seconds. If using a multiprocessing solution, it may be beneficial to pickle the concepts object (in whatever form) to the hard drive once then have each process unpickle from the hard drive rather than pickling/unpickling on each side of the IPC each time a new process is created, definitely worth testing - YMMV . The pickled set is 380 MB on my hard drive.

When I tried some experiments with concurrent.futures.ProcessPoolExecutor I kept blowing up my computer because each process needed its own copy of the set and my computer just doesn't have enough ram.

I'm going to post another answer dealing with the method of testing for concepts in sentences.

wwii
  • 23,232
  • 7
  • 37
  • 77
1

This answer will address improving performance without using concurrency.


The way you structured your search you are looking for 13 million unique things in each sentence. You said it takes 3-5 minutes for each sentence and that the word lengths in concepts range from one to ten.

I think you can improve the search time by making a set of concepts (either initially when constructed or from your list) then splitting each sentence into strings of one to ten (consecutive) words and testing for membership in the set.

Example of a sentence split into 4 word strings:

'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems'
# becomes
[('data', 'mining', 'is', 'the'),
 ('mining', 'is', 'the', 'process'),
 ('is', 'the', 'process', 'of'),
 ('the', 'process', 'of', 'discovering'),
 ('process', 'of', 'discovering', 'patterns'),
 ('of', 'discovering', 'patterns', 'in'),
 ('discovering', 'patterns', 'in', 'large'),
 ('patterns', 'in', 'large', 'data'),
 ('in', 'large', 'data', 'sets'),
 ('large', 'data', 'sets', 'involving'),
 ('data', 'sets', 'involving', 'methods'),
 ('sets', 'involving', 'methods', 'at'),
 ('involving', 'methods', 'at', 'the'),
 ('methods', 'at', 'the', 'intersection'),
 ('at', 'the', 'intersection', 'of'),
 ('the', 'intersection', 'of', 'machine'),
 ('intersection', 'of', 'machine', 'learning'),
 ('of', 'machine', 'learning', 'statistics'),
 ('machine', 'learning', 'statistics', 'and'),
 ('learning', 'statistics', 'and', 'database'),
 ('statistics', 'and', 'database', 'systems')]

Process:

concepts = set(concepts)
sentence = sentence.split()
#one word
for meme in sentence:
    if meme in concepts:
        #keep it
#two words
for meme in zip(sentence,sentence[1:]):
    if ' '.join(meme) in concepts:
        #keep it
#three words
for meme in zip(sentence,sentence[1:],sentence[2:]):
    if ' '.join(meme) in concepts:
        #keep it

Adapting an itertools recipe (pairwise) you can automate that process of making n-word strings from a sentence:

from itertools import tee
def nwise(iterable, n=2):
    "s -> (s0,s1), (s1,s2), (s2, s3), ... for n=2"
    iterables = tee(iterable, n)
    # advance each iterable to the appropriate starting point
    for i, thing in enumerate(iterables[1:],1):
        for _ in range(i):
            next(thing, None)
    return zip(*iterables)

Testing each sentence looks like this

sentence = sentence.strip().split()
for n in [1,2,3,4,5,6,7,8,9,10]:
    for meme in nwise(sentence,n):
        if ' '.join(meme) in concepts:
            #keep meme

I made a set of 13e6 random strings with 20 characters each to approximate concepts.

import random, string
data =set(''.join(random.choice(string.printable) for _ in range(20)) for _ in range(13000000))

Testing a four or forty character string for membership in data consistently takes about 60 nanoseconds. A one hundred word sentence has 955 one to ten word strings so searching that sentence should take ~60 microseconds.

The first sentence from your example 'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems' has 195 possible concepts (one to ten word strings). Timing for the following two functions is about the same: about 140 microseconds for f and 150 microseconds for g:

def f(sentence, data=data, nwise=nwise):
    '''iterate over memes in sentence and see if they are in data'''
    sentence = sentence.strip().split()
    found = []
    for n in [1,2,3,4,5,6,7,8,9,10]:
        for meme in nwise(sentence,n):
            meme = ' '.join(meme)
            if meme in data:
                found.append(meme)
    return found

def g(sentence, data=data, nwise=nwise):
    'make a set of the memes in sentence then find its intersection with data'''
    sentence = sentence.strip().split()
    test_strings = set(' '.join(meme) for n in range(1,11) for meme in nwise(sentence,n))
    found = test_strings.intersection(data)
    return found

So these are just approximations since I'm not using your actual data but it should speed things up quite a bit.

After testing with your example data I found that g won't work if a concept appears twice in a sentence.


So here it is all together with the concepts listed in the order they are found in each sentence. The new version of f will take longer but the added time should be relatively small. If possible would you post a comment letting me know how much longer it is than the original? (I'm curious).

from itertools import tee

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process', 
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

concepts = set(concepts)

def nwise(iterable, n=2):
    "s -> (s0,s1), (s1,s2), (s2, s3), ... for n=2"
    iterables = tee(iterable, n)
    # advance each iterable to the appropriate starting point
    for i, thing in enumerate(iterables[1:],1):
        for _ in range(i):
            next(thing, None)
    return zip(*iterables)

def f(sentence, concepts=concepts, nwise=nwise):
    '''iterate over memes in sentence and see if they are in concepts'''
    indices = set()
    #print(sentence)
    words = sentence.strip().split()
    for n in [1,2,3,4,5,6,7,8,9,10]:
        for meme in nwise(words,n):
            meme = ' '.join(meme)
            if meme in concepts:
                start = sentence.find(meme)
                end = len(meme)+start
                while (start,end) in indices:
                    #print(f'{meme} already found at character:{start} - looking for another one...') 
                    start = sentence.find(meme, end)
                    end = len(meme)+start
                indices.add((start, end))
    return [sentence[start:end] for (start,end) in sorted(indices)]


###########
results = []
for sentence in sentences:
    results.append(f(sentence))
    #print(f'{sentence}\n\t{results[-1]})')


In [20]: results
Out[20]: 
[['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems'],
 ['data mining', 'interdisciplinary subfield', 'information', 'information'],
 ['data mining', 'knowledge discovery', 'databases process', 'process']]
wwii
  • 23,232
  • 7
  • 37
  • 77
  • Wow, this is impressive. So, I f I understand you correctly, `data` is the `concepts` list and I need to append `found` results to `output`? I will run this on my real dataset and let you know the results. very excited to run. – EmJ Jan 08 '19 at 23:48
  • Hi, I am still unable run the program as I am not quite sure about what is meant by `data`, `nwise`, `found`. Can you please post the full code? I look forward to hearing from you. – EmJ Jan 09 '19 at 00:06
  • 1
    `nwise` is the function I showed, it uses itertools.tee. `data` is a set I made for testing it is supposed to approximate your `concepts`. see the edit. – wwii Jan 09 '19 at 03:39
  • I will try this and let you know the results. Thank you very much :) – EmJ Jan 09 '19 at 04:16
  • Hi, I ran your code for a part of my dataset. it is fast than the program I currently have. Very impressed :) However, the only issue I have is that the extracted `concepts` are not in the order of the sentence. i.e. given_concepts like `["big example", "small example", "example", "small", "big"]`. Then running for sentence `"this big example has a small solution for example"` should find `["big example", "small", "example"]`. Please let me know if my explaination is not clear. – EmJ Jan 09 '19 at 05:08
  • 1
    Sorry I didn't notice that part of the spec - see the edit. – wwii Jan 09 '19 at 19:50
  • Thanks a lot. This runs absolutely fast. I am so happy and I am going to use this for my work :) Just wondering if it is possible to only keep the longest concept as the result. i.e. given a sentence; `persistence length of sodium` and concepts = `['persistence', 'length', 'persistence length', 'sodium']` only extract `['persistence length', 'sodium']`. Not `persistence` and `length` individually. Please let me know if my explaination is not clear. – EmJ Jan 10 '19 at 01:25
  • Hi, I fixed the above issue by using the code at the end of the program ``results = [[w1 for w1 in l if not any([w2 != w1 and w2.find(w1) >= 0 for w2 in l])] for l in results]`. – EmJ Jan 10 '19 at 01:41
  • sentence = `viscosity behaviour and persistence length of sodium xanthan in aqueous sodium chloride zero shear rate intrinsic viscosities eta of sodium xanthan in aqueous nacl at 25 degrees c were determined for five samples ranging in weight average molecular weight from 2 x 10 5 to 4 x 10 6 at salt concentrations cs between 0 005 and 1 m at which the polysaccharide maintains its double helical structure` – EmJ Jan 10 '19 at 01:54
  • resutls = `['viscosity', 'behaviour', 'persistence length', 'xanthan', 'xanthan', 'aqueous', 'aqueous', 'sodium chloride', 'zero', 'shear rate', 'intrinsic', 'viscosities', 'eta', 'nacl', 'degrees c', 'determined', 'samples', 'ranging', 'weight average molecular weight', 'salt', 'concentrations', 'cs', 'and 1', 'polysaccharide', 'double helical', 'structure']` – EmJ Jan 10 '19 at 01:54
  • However, I encountered another small issue. For the sentence mentioned above in the comments, I get the above-mentioned output. As you can see "xanthan", "aqueous" have appeared twice even though it is used once in the sentence before `zero shear rate`. I re-checked my concepts list and I can assure you that the two words "xanthan", "aqueous" appear only once in the concepts list. Do you have any idea to resolve this? – EmJ Jan 10 '19 at 01:56
  • 1
    Good one: see the edit - I refactored `f` to keep track of indices and look for the concept again if it had already been found. I don't think it will increase the time any/appreciably. – wwii Jan 10 '19 at 19:39
  • Thank you so much. It works perfectly now. This is impressive. My whole dataset will be processed in 17 hours according to my time calculation. Earlier it was like half a month. Thanks a lot for the help. I truely appreciate it. Thank you again :) – EmJ Jan 10 '19 at 23:14
  • 1
    So about 0.122 seconds per sentence? -seems a bit high. If you have enough ram you could try some multiprocessing tests with concurrent futures using this function. – wwii Jan 11 '19 at 00:26
  • That is a really great idea. I will try that today. Now, I can let the dataset run in the weekends and get back to work on monday. Such a reliefed weekend thanks to you :) Thanks a lot once again. – EmJ Jan 11 '19 at 00:36
  • Hi, sorry for getting back to you. I am wondering why it takes a long time (I feel like the program is stucked there) when a sentence is like this comes: `four 1 76 heterozygotes for the triple alpha gene arrangement alpha alpha alpha alpha alpha were found.` – EmJ Jan 11 '19 at 22:39
  • 1
    No way I can check without your data. Do you understand how the function works? You can test it with that sentence, it only has 117 checks. I put some *prints* in the function that you can use to troubleshoot, you may need to add more. Sequential repeated words that are concepts will cause duplication of effort which is wasted. – wwii Jan 11 '19 at 23:33
  • Thank you so much. That is truely very helpful. I will check the prints to troubleshoot :) Thanks a lot again. – EmJ Jan 11 '19 at 23:38
  • Hip hip hooray! Cheers! I could successfully run my entire dataset. It is pretty fast that my time calculations. I would highly recommend this solution for anyone who is reading my question. Thanks a lot @wwii. I am really impressed by your solution. Thank you for your constant support once again :) :) :) – EmJ Jan 13 '19 at 01:26
  • Hi, I again faced a problem that I need to improve the efficiency and I thought that you might have some really good ideas to improve it. My question is: https://stackoverflow.com/questions/56537560/how-to-efficiently-calculate-triad-census-in-undirected-graph-in-python Please let me know your thoughts. Thank you very much :) – EmJ Jun 14 '19 at 05:48