Merge several txt. files with multiple lines to one csv file (1 line = 1 document) for Topic Modeling

Question

I have 30 text files so far which all have multiple lines. I want to apply a LDA Model based on this tutorial . So, for me it should look this:

text of document1
text of document2
text of document3 
.....
text of document30

But the whole text of a specific document has to be on one line.

I tried this post and for some reason it keeps saying: csv_output.writerow(row[1] for row in csv_text) IndexError: list index out of range . Any thoughts? I named the documents in a same way and edited the range, of course.

Basically, I don't care if we can solve this problem with python or not. I'm just done with my nerves so I really appreciate every help

score 0 · Answer 1 · answered Jun 03 '20 at 15:05

0

Loop over the files, 1 to 31 (last is skipped by the range() function:

with open("lda_datafile.csv", "w", newline="") as wf:
    csv_output = csv.writer(wf)
    for x in range(1, 31):
        filepath = os.path.normpath(r"C:\Text\file{}.txt".format(x))
        with open(filepath, "r", newline="") as rf:
            csv_text = csv.reader(rf, delimiter=":", skipinitialspace=True)
            csv_output.writerow(row[1] for row in csv_text)

answered Jun 03 '20 at 15:05

Gustav Rasmussen

3,720
4
23
53

Thanks for that quick answer ! Sadly, I still get "csv_output.writerow(row[1] for row in csv_text) IndexError: list index out of range" – tomankid Jun 03 '20 at 15:24

score 0 · Accepted Answer · answered Jun 08 '20 at 10:03

0

I'm not exactly sure what you are trying to accomplish, but to remove the newlines for textfiles and make one big text file with the results, something among the following should work:

for i in *.txt; do NEW=` cat $i | tr '\n' ' '` ; echo $NEW  >> output.txt; done

answered Jun 08 '20 at 10:03

Bart Barnard

1,128
8
17

I have a text file with e.g. 2403 lines which I want to convert into a single line text file. This should be replied for the other 30 text files. So, 1 line = 1 document. Finally, they should be merged in a csv file where I have a csv file with 30 lines (30 documents) – tomankid Jun 08 '20 at 11:07
Well, that is basically what the line above does. It replaces all the newline in a text file an places that one line at the end of the file `output.txt`. If you have thirty files, this line will create a file with thirty lines, each line containing the content of the corresponding input-file. – Bart Barnard Jun 08 '20 at 11:14
Thanks ! I tried it with 3 test documents and it works ! But for my real documents it says: "tr: Illegal byte sequence". The reason is simple: I converted the pdfs to txt files with Mac's Automator programm. Seems it uses awkward encoding. Do you have a clue how to fix it? I will mark this answer as solved anyway since it does what I want. – tomankid Jun 08 '20 at 12:24
That's a whole different question as you yourself already kind of point out. You should make sure that you use the same encoding in all instances (so both Mac Automator and your terminal, probably utf-8). – Bart Barnard Jun 08 '20 at 20:18
I managed to convert the utf 16 txt files to utf 8 ones with this Automator workflow. [http://automatorworld.com/archives/convert-text/] – tomankid Jun 09 '20 at 17:24
The first time I had serious encoding-problems was about thirty years ago (something between Mac en DOS, at the time). You would expect the industry would have solved these issues by now, but I still enounter them way too often... – Bart Barnard Jun 12 '20 at 06:21

Merge several txt. files with multiple lines to one csv file (1 line = 1 document) for Topic Modeling

2 Answers2