Convert multiple txt to csv without data loss

Question

Disclaimer: I am a Python novice and would very much appreciate detailed answers.

Update: Removed non-relevant code.

Update: The Problem was the Excel limit of strings per cell. I provided by own solution based on a proposed solution below.

I want to merge multiple .txt-files into a single .csv-file by row. Here is some replication data. The attempted output file is data_replication.csv. As you can see, only two of the five .txt-files were successfully integrated into the .csv-file. There, you'll also be able to find the input files in .pdf-form. It's unstructured random papers I found on Google Scholar.

The function I was using was proposed by Bill Bell in 'Combine a folder of text files into a CSV with each content in a cell'.

The function I used for the conversion from .pdf to .txt was proposed b hkr to the similar question 'Convert a PDF files to TXT files':

def txt_to_csv(x): 

    os.chdir('/content/drive/MyDrive/ThesisAllocationSystem/' + x)

    with open(x + '.csv', 'w', encoding = 'Latin-1') as out_file:
        csv_out = csv.writer(out_file)
        csv_out.writerow(['FileName', 'Content'])
        for fileName in Path('.').glob('*.txt'):
            lines = [ ]
            with open(str(fileName.absolute()),'rb') as one_text:
                for line in one_text.readlines():
                    lines.append(line.decode(encoding='Latin-1',errors='ignore').strip())
            csv_out.writerow([str(fileName),' '.join(lines)])

txt_to_csv('data_replication')

I'm guessing that data type might be the problem here, and appreciate any attempt to help me.

Data type is not the problem. `.csv` files deal only in strings, and the module calls `str()` on every value passed to it before writing it. But without a sample of your input data it's difficult to say what the problem really is. For example, you would get the result you report if 3 of the files were empty or contained binary data. — BoarGules, Mar 23 '21 at 20:58
There's a link to replication/sample data in the post. Let me know if it is sufficient and/or the link does not work. — blue-create, Mar 24 '21 at 09:00
The link works and I see output, but no input. And the likeliest source of trouble is the input, since the code works as you expect 2 times out of 5. — BoarGules, Mar 24 '21 at 09:04
Thanks for the note. I've updated the post with the original function used to convert source .pdf-files to input .txt-files for the single .csv output file. The linked folder contains both .pdf- and .txt-files as well as the .csv-file. — blue-create, Mar 24 '21 at 09:19

RJ Adriaansen · Answer 1 · 2021-03-24T09:42:31.660

1

You can use pandas for this:

from glob import glob
import pandas as pd

files = glob('/content/drive/MyDrive/ThesisAllocationSystem/*.txt') # create list of text files
data = [[i, open(i, 'rb').read()] for i in files] # create a list of lists with file names and texts
df = pd.DataFrame(data, columns=['FileName', 'Content']) # load the data in a pandas dataframe
df.to_csv('data_replication.csv') # save to csv

edited Mar 24 '21 at 09:42

answered Mar 23 '21 at 21:08

RJ Adriaansen

9,131
2
12
26

I'm getting the following error: `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 110964: invalid start byte`. How do I add the decode() step from bytes to string? – blue-create Mar 24 '21 at 09:11
Ok, I see that you have encoded the files in `latin-1`. I have updated my answer, please try it. – RJ Adriaansen Mar 24 '21 at 09:29
Unfortunately, the output file ends up empty. – blue-create Mar 24 '21 at 09:35
Ok, try this. The `with open` is now exactly the same as in your code, so it should work. – RJ Adriaansen Mar 24 '21 at 09:42
Unfortunately, the same issue is occurring as with the other function: data loss. Please see `data_replication_answer1.csv` for a demo of your code. – blue-create Mar 24 '21 at 09:55

score 0 · Accepted Answer · answered Mar 24 '21 at 10:52

Using RJ Adriaansen's proposed function as a blueprint, I created the following function for people suffering under the same constraint: Excel's hard limit for strings per cell: 32767.

One approach would be to forego the documents with string content of more than 33k. However, that would have led to considerable data loss in my case.

Instead, I sliced the documents to exactly 32767 strings.

from glob import glob
import pandas as pd

def txt_to_csv(input_dir, output_dir, new_filename): 
  
  files = glob('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + '/*.txt')
  data = [[i, open(i, 'rb').read()] for i in files]
  df = pd.DataFrame(data, columns = ['FileName', 'Content'])
  df['Content'] = df['Content'].str.slice(start = 0, stop = 32767) # Upper limit of strings per cell in csv
  df.to_csv(output_dir + '/' + new_filename + '.csv', index = False)

txt_to_csv('data_replication', 'data_replication', 'trial')

Convert multiple txt to csv without data loss

2 Answers2