-1

Hello so I am trying to filter the bad words from this list, I have for this script usually list of 5 to 10 million line of words, I tried threading to make it fast but after the first 20k word it gets slower and slower why is that, will it be faster if I use Multiprocessing instead ? I run this script on Ubuntu with 48 CPU core and 200GB RAM

from tqdm import tqdm
import queue
import threading

a=input("The List: ")+".txt"
thr=input('Threads: ')
c=input("clear old[y]: ")
inputQueue = queue.Queue()

if c == 'y' or c == 'Y':#clean
    if c =="y":
        open("goodWord.txt",'w').close()

s = ["bad_word"]#bad words list

class myclass:
    def dem(self,my_word):
        for key in s:
            if key in my_word:
                return 1
        return 0

    def chk(self):
        while 1:
            old = open("goodWord.txt","r",encoding='utf-8',errors='ignore').readlines()
            y = inputQueue.get()
            if my_word not in old:
                rez = self.dem(my_word)
                if rez == 0:
                    sav = open("goodWord.txt","a+")
                    sav.write(my_word+"\n")
                    sav.close()
                    self.pbar.update(1)
                else :
                    self.pbar.update(1)

            inputQueue.task_done()



    def run_thread(self):
        for y in tqdm(open(a, 'r',encoding='utf-8', errors='ignore').readlines()):
            inputQueue.put(y)

        tqdm.write("All in the Queue")
        self.pbar = tqdm(total=inputQueue.qsize(),unit_divisor=1000)
        for x in range(int(thr)):
            t = threading.Thread(target=self.chk)
            t.setDaemon(True)
            t.start()
        inputQueue.join()

try:
    open("goodWord.txt","a")
except:
    open("goodWord.txt","w")

old = open("goodWord.txt","r",encoding='utf-8',errors='ignore').readlines()
myclass=myclass()
omyclass.run_thread()



oscar0
  • 11
  • 1
  • Welcome to SO! If you're running CPython, it's effectively single-threaded thanks to the global interpreter lock, so adding more threads is like adding more kids fighting over a single piece of candy. Use multiprocessing and/or write a C extension. – ggorlen Apr 14 '20 at 23:02
  • Welcome to StackOverflow! If you're looking for general review and critique of a working piece of code, you may want to consider posting to [CodeReview.SE](https://codereview.stackexchange.com) instead. – Brian61354270 Apr 14 '20 at 23:04
  • I don't think this is really appropriate for CR unless OP also wants feedback on their code as a whole instead of just performance. OP should see [this](https://codereview.stackexchange.com/help/on-topic) before migrating in any case. – ggorlen Apr 14 '20 at 23:04
  • 1
    Performance is a complex topic, can you be more specific? Some notes on style and similar things: You’re seemingly mixing multiple naming conventions. Keep it simple and stick to the basics: `CamelCase` for classes, `lower_case_with_underscores` for functions and variables. Using a bare except in the way you are here is a bad idea, see https://stackoverflow.com/questions/54948548/what-is-wrong-with-using-a-bare-except. Finally, you should use a context manager to handle file objects, they’re great. – AMC Apr 14 '20 at 23:08
  • It seems you are doing a string search. From what I recall, the in operator (or not in) iterates over the whole list. Perhaps look into using am ore efficient algorithm like KMP or Boyer-Moore – Sri Apr 14 '20 at 23:10
  • 2
    Easiest would be to split your input into, say, 48 files, and run 48 processes (scripts) in parallel. From your description, it sounds like your process can be run easily in parallel, with no dependencies between the inputs. – 9769953 Apr 14 '20 at 23:11
  • Threading in Python is still very much still single-core, since it doesn't release the GIL. Multiprocessing can use all cores in parallel, but it has the overhead that it will copy the individual data (I think that a recent change may allow for shared across processes). – 9769953 Apr 14 '20 at 23:13
  • More on style/design: Don't use 0 and 1 for boolean values unless you really need to. In the same vein, use `while True:` instead of `while 1:`. – AMC Apr 14 '20 at 23:17

1 Answers1

0

For the sake of curiosity and education, I wrote a virtually identical (in function) program:

import pathlib

from tqdm import tqdm

# check_words_file_path = pathlib.Path(input("Enter the path of the file which contains the words to check: "))
check_words_file_path = pathlib.Path("/Users/****/Documents/Projects/AdHoc/resources/temp/check_words.txt")

good_words_file_path = pathlib.Path("/Users/****/Documents/Projects/AdHoc/resources/temp/good_words.txt")

bad_words = {"abadword", "anotherbadword"}

# load the list of good words
with open(good_words_file_path) as good_words_file:
    stripped_lines = (line.rstrip() for line in good_words_file)
    good_words = set(stripped_line for stripped_line in stripped_lines if stripped_line)

# check each word to see if is one of the bad words
# if it isn't, add it to the good words
with open(check_words_file_path) as check_words_file:
    for curr_word in tqdm(check_words_file):
        curr_word = curr_word.rstrip()
        if curr_word not in bad_words:
            good_words.add(curr_word)

# write the new/expanded list of good words back to file
with open(good_words_file_path, "w") as good_words_file:
    for good_word in good_words:
        good_words_file.write(good_word + "\n")

It is based on my understanding of the original program, which, as I already mentioned, I find far too complex.

I hope that this one is clearer, and it is almost certainly much faster. In fact, this might be fast enough that there is no need to consider things like multiprocessing.

AMC
  • 2,642
  • 7
  • 13
  • 35
  • @oscar0 Let me know how it performs, and if I misunderstood any part of the operation :) – AMC Apr 15 '20 at 01:17