0

I am trying to open a text file, remove certain words that have a ] after them, and then write the new contents to a new file. With the following code, new_content contains what I need, and a new file is created, but it's empty. I cannot figure out why. I've tried indenting differently and passing in an encoding type, with no luck. Any help greatly appreciated.

import glob
import os
import nltk, re, pprint
from nltk import word_tokenize, sent_tokenize
import pandas
import string
import collections

path = "/pathtofiles"

for file in glob.glob(os.path.join(path, '*.txt')):
    if file.endswith(".txt"):
        f = open(file, 'r')
        flines = f.readlines()
        for line in flines: 
            content = line.split() 

            for word in content:
                if word.endswith(']'):
                    content.remove(word)

            new_content = ' '.join(content)

            f2 = open((file.rsplit( ".", 1 )[ 0 ] ) + "_preprocessed.txt", "w")
            f2.write(new_content)
            f.close
firefly
  • 339
  • 2
  • 10
  • `for word in content: if word.endswith(']'): content.remove(word)` that's removing while iterating: _bad_ – Jean-François Fabre Apr 03 '18 at 18:09
  • `f.close` does nothing, and indentation is wrong. – Jean-François Fabre Apr 03 '18 at 18:09
  • 1
    `if file.endswith(".txt")` is guaranteed to be always true because of the globbing your performed. – Jean-François Fabre Apr 03 '18 at 18:10
  • you're not closing `f2` at all – Jean-François Fabre Apr 03 '18 at 18:10
  • 1
    You should open the file for writing with mode 'a'. See: https://docs.python.org/3/library/functions.html#open. Or make a list of words and then use `writelines` – Adonis Apr 03 '18 at 18:10
  • @Jean-FrançoisFabre thanks for the comments, I'll work on those. Why is removing while iterating bad? – firefly Apr 03 '18 at 18:16
  • Build `new_content` as you iterate through `content` instead of doing remove. – stark Apr 03 '18 at 18:16
  • @Adonis thanks for the help, I'll look in to those. I would like to understand whythis code doesn't work however, as when I originally wrote it for something else it worked perfectly (even if perhaps not the best solution). – firefly Apr 03 '18 at 18:18
  • "Worked perfectly", can you tell more? Can you give us the data you are using? Because right now, using "write" mode instead of "append" when opening a file is going to erase what was previously in the file, which at best will result in a file containing one word. Please have a look at how to create a [mcve] – Adonis Apr 03 '18 at 18:23
  • For remove while iterating, see: https://stackoverflow.com/questions/10665591/how-to-remove-list-elements-in-a-for-loop-in-python – stark Apr 03 '18 at 18:23

1 Answers1

1

This should work @firefly. Happy to answer questions if you have them.

import glob
import os

path = "/pathtofiles"

for file in glob.glob(os.path.join(path, '*.txt')):
    if file.endswith(".txt"):
        with open(file, 'r') as f:
            flines = f.readlines()
            new_content = []
            for line in flines: 
                content = line.split() 

                new_content_line = []

                for word in content:
                    if not word.endswith(']'):
                        new_content_line.append(word)

                new_content.append(' '.join(new_content_line))

            f2 = open((file.rsplit( ".", 1 )[ 0 ] ) + "_preprocessed.txt", "w")
            f2.write('\n'.join(new_content))
            f.close
            f2.close
Peter Dolan
  • 1,393
  • 1
  • 12
  • 26
  • Are you sure about the `mode="w"`? – Adonis Apr 03 '18 at 18:23
  • works on my machine ¯\_(ツ)_/¯. AFAIK the difference between w and a is simply creating a new file vs appending. OP seemed to indicate wanting a new file each time so `w` makes more sense to me – Peter Dolan Apr 03 '18 at 18:26
  • Thank you @PeterDolan! I understand what this is doing and why it's better. However the file is still blank for me! I was working on a mac and have tried it on windows too. Is there something else I could be doing wrong? – firefly Apr 04 '18 at 09:33
  • So for some reason in the end this worked for some finals but not all...by adding `encoding = 'utf-8'` after `r` and `w` it worked for all files. – firefly Apr 04 '18 at 13:32
  • Interesting, maybe there are some weird characters in your files. Glad to hear its working though -- any other questions? – Peter Dolan Apr 04 '18 at 16:07