-1

I have a large text file that I'd like to turn into a list of words. I've been able to get as far as getting a list for each line in the file, but I want a single list.

Here's what I have.

import unicodedata
import codecs

infile = codecs.open('FILE.txt', 'r', encoding ='ascii', errors = 'ignore')
outfile = codecs.open('FILE2.txt', 'w', encoding ='ascii', errors = 'ignore')

for word in infile:
    mylist = str(word.split())

    outfile.write(mylist)
infile.close()
outfile.close()

This gives me an outfile that looks like:

[word, word][word, word, word, word][word, word]...[word,word]

I am hoping to get an outfile that looks like:

[word, word, word, .... word, word, word]

I know how to concatenate multiple lists, but these lists are immediately written to my outfile. As written, my code would not allow for me to concatenate the lists after the fact.

UPDATE:

Thank you for all of your help. I have solved the problem with the following:

import unicodedata
import codecs

infile = codecs.open('FILE1.txt', 'r', encoding ='ascii', errors = 'ignore')
outfile = codecs.open('FILE2.txt', 'w', encoding ='ascii', errors = 'ignore')

mylist =[]
for line in infile:
    for word in line.split():
        mylist.append(word)



outfile.write(str(mylist))
infile.close()
outfile.close()
TheArtofXin
  • 33
  • 1
  • 4
  • do you want a list with duplicates or a set without dupes? is order important? – Patrick Artner Oct 22 '18 at 18:04
  • 1
    Possible duplicate of [How to concatenate two lists in Python?](https://stackoverflow.com/questions/1720421/how-to-concatenate-two-lists-in-python) –  Oct 22 '18 at 18:04
  • might try `for word in infile.readlines():`... – chickity china chinese chicken Oct 22 '18 at 18:05
  • you are converting a list to a string and then writing to a file rather than converting contents of list to string. – mad_ Oct 22 '18 at 18:06
  • If your problem is writing each line as you find it, then *quit writing each line as you find it*. You control the code: concatenate the lists and save the printing until after the loop; then print the entire concatenated list. Another possibility is to suppress the newline on your `write`. Each of these is a basic technique you can look up. – Prune Oct 22 '18 at 18:11
  • I'm not concerned about dupes or order. – TheArtofXin Oct 22 '18 at 18:11
  • @Prune Thanks. Your comment was the most helpful. – TheArtofXin Oct 22 '18 at 22:29

4 Answers4

0

You can use infile.readlines().split() instead of the for loop. A more "pythonic" way, is to use the with statement, like so:

with codecs.open('FILE.txt', encoding ='ascii') as infile:
        with codecs.open('FILE2.txt', encoding ='ascii') as outfile:
            outfile.write(inline.readlines().split())
Yoav Abadi
  • 403
  • 7
  • 16
0

Exaple to get all unique words from your file, no order:

# create demo file
with open("FILE.txt", "w", encoding ='ascii',) as f:
    f.write("Some data with newlines\n And duplicate data words with no sense\n" +
            "in it also newlines and \nmore stuff\nto parse and with Some data in it\n" + 
            "Done.")

# read demo file and write other file
with open ('FILE.txt', 'r', encoding ='ascii', errors = 'ignore') as infile,\
     open ('FILE2.txt', 'w', encoding ='ascii', errors = 'ignore') as outfile:

    data = set( ( w for line in infile for w in line.split()) )

    # write single words from set
    for word in data:
        outfile.write(word+"\n")

    # write set as list-repr()    
    outfile.write("\n"+str(list(data)))

with open("FILE2.txt") as f:
    print(f.read())

Output:

sense
it
stuff
words
in
data
Some
And
no
also
to
Done.
more
with
duplicate
parse
and
newlines

['sense', 'it', 'stuff', 'words', 'in', 'data', 'Some', 'And', 'no', 'also', 'to', 'Done.', 'more', 'with', 'duplicate', 'parse', 'and', 'newlines']
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
0
from nltk.tokenize import word_tokenize,sent_tokenize
list_sentence=[]
test_text_file=open('xyz.txt', 'rt')
test_text_file1=test_text_file.read()
for s in word_tokenize(test_text_file1):
    list_sentence.append(s)
print list_sentence    

this will give you a list of words

0

Just flatten your list before your write it, is there an stipulation against that ?

for word in infile:
    mylist = str(word.split())
    mylist = [j for i in mylist for j in i]
    outfile.write(mylist)
vash_the_stampede
  • 4,590
  • 1
  • 8
  • 20