0

I've cobbled together some working python that opens a text file, converts it to lowercase, eliminates stopwords, and outputs a list of most frequently used words in the file:

from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))
file1 = open("ocr.txt")
line = file1.read()
words =  line.split()
words = [word.lower() for word in words]
for r in words:
    if not r in stop_words:
        appendFile = open('cleaned_output.txt','a')
        appendFile.write(" "+r)
        appendFile.close()
with open("cleaned_output.txt") as input_file:
    count = Counter(word for line in input_file
                         for word in line.split())

print(count.most_common(10), file=open('test.txt','a'))

I'd like to amend it to perform the same actions on all files in a directory, and output the results to unique text files or as rows in a csv. I know that os.path can probably be used here, but I'm not sure how. I'd really appreciate some help. Thank you in advance!

Paul
  • 25
  • 6
  • something like `files = [f for f in os.listdir(directory) if os.path.isfile(os.path.join(directory,f))]`? – Jan Stránský Aug 18 '20 at 18:45
  • First I would through all your current code into a function that takes a file path as the param, then use os.listdir() to give you a list of files, then loop through the files passing them to the function. After that, it's simply writing it all to a file. – Rashid 'Lee' Ibrahim Aug 18 '20 at 18:47
  • @Rashid'Lee'Ibrahim Thanks for the tip. How might I go about doing that? – Paul Aug 18 '20 at 19:30

2 Answers2

1

I converted your snippet into a function that takes the path to the folder containing the input files as an argument. The following code takes all files from the specified folder and generates both a cleaned_output.txt and test.txt for each file in that folder to a newly created output directory. The output files have the names of the input files they were generated from appended at the end to make differentiating between them easier but you can change that to suit your needs.

from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import os


path = 'input/'

def clean_text(path):
  try:
    os.mkdir('output')
  except:
    pass
  
  out_path = 'output/'

  files = [f for f in os.listdir(path) if os.path.isfile(path+f)]
  file_paths = [path+f for f in files]
  file_names = [f.strip('.txt') for f in files]
  
  for idx, f in enumerate(file_paths):
    stop_words = set(stopwords.words('english'))
    file1 = open(f)
    line = file1.read()
    words =  line.split()
    words = [word.lower() for word in words]
    print(words)

    for r in words:
        if not r in stop_words:
            appendFile = open(out_path + 'cleaned_output_{}.txt'.format(file_names[idx]),'a')
            appendFile.write(" "+r)
            appendFile.close()
    with open(out_path + 'cleaned_output_{}.txt'.format(file_names[idx])) as input_file:
        count = Counter(word for line in input_file
                            for word in line.split())

    print(count.most_common(10), file=open(out_path + 'test_{}.txt'.format(file_names[idx]),'a'))



clean_text(path)

Is this what you were looking for?

  • The code is expecting all input files to be .txt files and will break if they are not. Some exception handling would do the trick if you want to make it work in all cases. But if you're sure that inputs will always be txt files then it will work as it is – Taimur Ibrahim Aug 18 '20 at 21:09
  • Thanks so much for helping with this. This sounds like exactly what I'm looking for. Unfortunately, though, when I run the amended code, the output directory is created but is empty. – Paul Aug 18 '20 at 21:27
  • So it seems to run, but actually hangs on line 24 with traceback: `UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9b in position 1432: invalid start byte`. The error looks similar to [this](https://stackoverflow.com/questions/42339876/error-unicodedecodeerror-utf-8-codec-cant-decode-byte-0xff-in-position-0-in/42340744). Any ideas? – Paul Aug 19 '20 at 01:52
  • There are probably some special characters in your input text that are causing the error. Have you tried the fix provided in the question you linked? i.e. with open(path, 'rb') as f: contents = f.read() – Taimur Ibrahim Aug 19 '20 at 08:12
  • I tried the solution, but I'm honestly not sure where it fits or how to adapt it. – Paul Aug 19 '20 at 12:37
  • Try adding it to where ever you're opening and reading files. I don't have the exact text that you're running on so it's kind of hard for me to debug the exact issue. – Taimur Ibrahim Aug 19 '20 at 12:49
  • The solution is to amend line 23 to read `file1 = open(f, encoding='latin-1')`. Success, and thanks again! – Paul Aug 19 '20 at 14:44
0

You can get all the files in a directory using os.listdir. This returns a list of the paths of all the items in a directory as strings that you can iterate through.

John Smith
  • 671
  • 1
  • 9
  • 22