1

Instead of defining documentslike this ...

documents = ["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"]

... I want to read the same three sentences from two different txt files with the first sentence in the first file, and sentence 2 and 3 in the second file.

I have come up with this code:

# read txt documents
os.chdir('text_data')
documents = []
for file in glob.glob("*.txt"): # read all txt files in working directory
    file_content = open(file, "r")
    lines = file_content.read().splitlines()
    for line in lines:
        documents.append(line)

But the documents resulting from the two strategies seem to be in different format. I want the second strategy to produce the same output as the first.

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
Simon Lindgren
  • 2,011
  • 12
  • 32
  • 46
  • 1
    ... what is wrong? Please try to be specific with your problem statements. – juanpa.arrivillaga Mar 25 '17 at 23:38
  • Edited for clarity. – Simon Lindgren Mar 25 '17 at 23:43
  • 1
    My point was that instead of writing "the `documents` resulting form the two strategies seem to be in different format" you should instead *show the output* – juanpa.arrivillaga Mar 25 '17 at 23:45
  • 1
    Also, doing this: `lines = file_content.read().splitlines()` is not necessary. You can iterate directly over the file handler, and it iterates over lines. So just `for line in file_content:` would be sufficient (although you'll get the trailing newlines). Likely, you just want `documents.append(file_content.read())` And you don't have to iterate over the file at all... – juanpa.arrivillaga Mar 25 '17 at 23:48
  • 1
    Possible duplicate of [combine multiple text files into one text file using python](http://stackoverflow.com/questions/17749058/combine-multiple-text-files-into-one-text-file-using-python) – OneCricketeer Mar 26 '17 at 00:35

3 Answers3

1

If I understand your code correctly, this is equivalent and more performant (no reading the entire file into a string, then splitting to a list).

os.chdir('text_data')
documents = []
for file in glob.glob("*.txt"): # read all txt files in working directory
    documents.extend( line for line in open(file) )

Or maybe even one line.

documents = [ line for line in open(file) for file in glob.glob("*.txt") ]
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
0

Instead of .read().splitlines(), you can use .readlines(). This will place every file's contents into a list.

Darkstarone
  • 4,590
  • 8
  • 37
  • 74
K.Land_bioinfo
  • 170
  • 1
  • 3
  • 12
  • I am new to stack overflow, @juanpa.arrivillaga. What I meant was that the contents of the list that .readlines() creates could be further appended to documents, but I see that your most recent comment answered what I was trying to explain. Thank you. – K.Land_bioinfo Mar 26 '17 at 00:00
0

... I want to read the same three sentences from two different txt files with the first sentence in the first file, and sentence 2 and 3 in the second file.

Translating the requirements directly gives:

with open('somefile1.txt') as f1:
    lines_file1 = f1.readlines()
with open('somefile2.txt') as f2:
    lines_file2 = f2.readlines()
documents = lines_file1[0:1] + lines_file2[1:3]

FWIW, given the kind of work you're doing, the [fileinput module][1] may be helpful.

Hope this get you back in business :-)

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485