Disclaimer: I am a Python novice and would very much appreciate detailed answers.
Update: Removed non-relevant code.
Update: The Problem was the Excel limit of strings per cell. I provided by own solution based on a proposed solution below.
I want to merge multiple .txt-files into a single .csv-file by row. Here is some replication data.
The attempted output file is data_replication.csv
. As you can see, only two of the five .txt-files were successfully integrated into the .csv-file. There, you'll also be able to find the input files in .pdf-form. It's unstructured random papers I found on Google Scholar.
The function I was using was proposed by Bill Bell in 'Combine a folder of text files into a CSV with each content in a cell'.
The function I used for the conversion from .pdf to .txt was proposed b hkr to the similar question 'Convert a PDF files to TXT files':
def txt_to_csv(x):
os.chdir('/content/drive/MyDrive/ThesisAllocationSystem/' + x)
with open(x + '.csv', 'w', encoding = 'Latin-1') as out_file:
csv_out = csv.writer(out_file)
csv_out.writerow(['FileName', 'Content'])
for fileName in Path('.').glob('*.txt'):
lines = [ ]
with open(str(fileName.absolute()),'rb') as one_text:
for line in one_text.readlines():
lines.append(line.decode(encoding='Latin-1',errors='ignore').strip())
csv_out.writerow([str(fileName),' '.join(lines)])
txt_to_csv('data_replication')
I'm guessing that data type might be the problem here, and appreciate any attempt to help me.