0

I have multiple text files in a folder. Total number of text files are 21941. My code is working good for the small number of text files but when I run for 5000 text files it gets stuck in reading. When I run my code for full data it takes 3 hours for reading the data but still not able to finish reading all data. Please help me how can I improve my code or how can I use GPU or multiprocessing for this task.

this block of code read a file and return it a list of words.

def wordList(doc):
    """
    1: Remove Punctuation
    2: Remove Stop Words
    3: return 
    """
    file = open("C:\\Users\\Zed\\PycharmProjects\\ACL txt\\"+doc, 'r', encoding="utf8", errors='ignore')
    text = file.read().strip()
    file.close()
    nopunc=[char for char in text if char not in string.punctuation]
    nopunc=''.join(nopunc)
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

This block of code read file names from folder

file_names=[]
for file in Path("ACL txt").rglob("*.txt"):
file_names.append(file.name)

And this block of code make a dictionary with all documents . file name as a key and its content as a list.

documents = {}
for i in file_names[:5000]:
documents[i]=wordList(i)

this is link of data set

My system spec is I7 quad core with 16gb ram

Massifox
  • 4,369
  • 11
  • 31
  • 1
    Your biggest issue is here: `stopwords.words('english')`. Outside your comprehension, use `english_stopwords = set(stopwords.words('english'))` then do: `[word for word in nopunc.split() if word.lower() not in stopwords.words('english')]`. Also, you may as well use `punctuation = set(string.punctuation)`. You should also not do the set conversion inside the function, do it outside, even if you use a global variable. – juanpa.arrivillaga Sep 17 '19 at 20:31
  • @juanpa.arrivillaga I am not sure about python set implementation, but given that there are about 10 punctuation signs I would actually expect linear search in punctuation array faster than lookup in any conventional hash table. – SergeyA Sep 17 '19 at 20:38
  • I would expect the conversion from string to list and back be quite a drag on performance with decently sized data inputs. – SergeyA Sep 17 '19 at 20:41
  • @SergeyA you'd be surprised. Linear search starts losing against the hash set at a handful of items in python, remember, it's not a primitive array, well, underneath the hood it's an array of PyObject pointers... Anyway, I've timed it and the hash set wins, `37.7 ns ± 0.422 ns` for the list vs `31.9 ns ± 0.282` for the set. – juanpa.arrivillaga Sep 17 '19 at 22:03
  • @juanpa.arrivillaga thanku it works fine now I load my all data. My ram is 12 GB used now. – Akmal Masud Sep 18 '19 at 00:51
  • @juanpa.arrivillaga My this code now stuck how can i improve this code? #create a corpus containing the vocabulary of words in the documents corpus = [] # a list that will store words of the vocabulary for doc in documents.values(): #iterate through documents for word in doc: #go through each word in the current doc if not word in corpus: corpus.append(word) #add word in corpus if not already added – Akmal Masud Sep 18 '19 at 00:52
  • @AkmalMasud One thing that you can do is store your vocab in a set. Adding a word to a list gets expensive as the list gets bigger. Another is, rather than converting to string and back, you can simply use re to remove the punctuation like `re.sub(f'[{string.punctuation}]', '', s)` – lahsuk Sep 18 '19 at 10:30
  • @juanpa.arrivillaga interesting indeed! Thanks for this information. – SergeyA Sep 18 '19 at 13:48
  • @lahsuk set contains only unique words but i need all words for counting them and make term frequency of all words. – Akmal Masud Sep 18 '19 at 16:48
  • @lahsuk can you check my this post and suggest me any changing in code https://stackoverflow.com/questions/57983960/create-a-corpus-containing-the-vocabulary-of-words?noredirect=1#comment102389356_57983960 – Akmal Masud Sep 18 '19 at 16:50
  • @AkmalMasud try using collections.counter if you want the word count. Just return this counter and update it after getting it back. (I've done a similar task on ~55k files which had ~24m tokens and took only 17 mins to complete) – lahsuk Sep 19 '19 at 04:16
  • @lahsuk can you share code that will help me. and You see my other post? – Akmal Masud Sep 19 '19 at 05:48
  • @lahsuk how can I make a bag of words from that much text my computer stuck in a nested loop when it appending in the bag of words list. – Akmal Masud Sep 19 '19 at 06:00
  • 1
    @AkmalMasud like I said before, use a counter object if all you want is to count the frequency of words. https://docs.python.org/3/library/collections.html#collections.Counter – lahsuk Sep 19 '19 at 14:08
  • @lahsuk I am doing this way def termFrequencyInDoc(wordList): dict_words=dict(Counter(wordList)) return dict_words – Akmal Masud Sep 19 '19 at 15:57

0 Answers0