0

I was recently getting into Natural Language Processing for a university project and, given a list of words, I wanted to try and delete all those words from a dataset of Strings. My dataset looks like this, but much bigger:

data_set = ['Human machine interface for lab abc computer applications',
         'A survey of user opinion of computer system response time',
         'The EPS user interface management system',
         'System and human system engineering testing of EPS',
         'Relation of user perceived response time to error measurement',
         'The generation of random binary unordered trees',
         'The intersection graph of paths in trees',
         'Graph minors IV Widths of trees and well quasi ordering',
         'Graph minors A survey']

The list of words to delete looks like this, but again, much longer:

to_remove = ['abc', 'of', 'quasi', 'well']

Since in Python I didn't find any function to directly delete words from strings, I used the replace() function. The program should take the data_set and, for each word in to_remove, it should call a replace() on a different string of the data_set. I was hoping that threads could speed things up, but unfortunately it takes almost the same time as the program without threads. Am I correctly implementing threads? Or did I miss something?

The code with threads is the following:

from multiprocessing.dummy import Pool as ThreadPool

def remove_words(params):
    changed_data_set = params[0]
    for elem in params[1]:
        changed_data_set = changed_data_set.replace(' ' + elem, ' ')
    return changed_data_set

def parallel_task(params, threads=2):
    pool = ThreadPool(threads)
    results = pool.map(remove_words, params)
    pool.close()
    pool.join()
    return results

parameters = []
for rows in data_set:
    parameters.append((rows, to_remove))
new_data_set = parallel_task(parameters, 8)

The code without threads is the following:

def remove_words(data_set, to_replace):
    for len in range(len(data_set)):
        for word in to_replace:
            data_set[len] = data_set[len].replace(' ' + row, ' ')
    return data_set

changed_data_set = remove_words(data_set, to_remove)
Leo
  • 241
  • 2
  • 3
  • 12
  • 1
    https://stackoverflow.com/questions/26432411/multiprocessing-dummy-in-python-is-not-utilising-100-cpu – pask Jul 13 '18 at 10:14
  • Thanks, I now understand what I did wrong and was able to correct my code. – Leo Jul 13 '18 at 10:57

0 Answers0