I've cobbled together some working python that opens a text file, converts it to lowercase, eliminates stopwords, and outputs a list of most frequently used words in the file:
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
file1 = open("ocr.txt")
line = file1.read()
words = line.split()
words = [word.lower() for word in words]
for r in words:
if not r in stop_words:
appendFile = open('cleaned_output.txt','a')
appendFile.write(" "+r)
appendFile.close()
with open("cleaned_output.txt") as input_file:
count = Counter(word for line in input_file
for word in line.split())
print(count.most_common(10), file=open('test.txt','a'))
I'd like to amend it to perform the same actions on all files in a directory, and output the results to unique text files or as rows in a csv. I know that os.path
can probably be used here, but I'm not sure how. I'd really appreciate some help. Thank you in advance!