2

I have a folder containing several thousand .txt files. I'd like to combine them in a big .csv according to the following model:

enter image description here

I found a R script supposed to do the job (https://gist.github.com/benmarwick/9265414), but it displays this error.

Error in read.table(file = file, header = header, sep = sep, quote = quote,  : duplicate 'row.names' are not allowed 

I don't understand what's my mistake.

No matter, I'm pretty sure there's a way to do that without R. If you know a very elegant and simple one, it would be appreciated (and useful for a lot of guys like me)

PRECISION : the text files are in french, so not ASCII. Here is a sample : https://www.dropbox.com/s/rj4df94hqisod5z/Texts.zip?dl=0

Ettore Rizza
  • 2,800
  • 2
  • 11
  • 23
  • If you're pretty familiar with Python, then it shouldn't be too hard to write a Python script using `os.walk` from the `os` [module](https://docs.python.org/2/library/os.html?highlight=os.walk#os.walk) to look through the contents of the directory, and the `csv` [module](https://docs.python.org/2/library/csv.html) to create the csv. – Nathaniel Verhaaren Jan 28 '17 at 18:24
  • Of course, there is certainly a cool solution in Python. I can think about it, but it would take me hours (i'm not skilled enough) and I'm afraid to reinvent the wheel. This is a problem that many people have certainly encountered. Strangely, I can't find a ready-made solution in Google. :/ – Ettore Rizza Jan 28 '17 at 18:34
  • Do you want the lines of the text files to be simply concatenated without their newline characters? – Bill Bell Jan 28 '17 at 18:52
  • Newlines are useful informations, but not very essentials for me. – Ettore Rizza Jan 28 '17 at 18:59

3 Answers3

5

The following python script works for me (where path_of_directory is replace by the path of the directory your files are in and output_file.csv is the path of the file you want to create/overwrite):

#! /usr/bin/python

import os
import csv

dirpath = 'path_of_directory'
output = 'output_file.csv'
with open(output, 'w') as outfile:
    csvout = csv.writer(outfile)
    csvout.writerow(['FileName', 'Content'])

    files = os.listdir(dirpath)

    for filename in files:
        with open(dirpath + '/' + filename) as afile:
            csvout.writerow([filename, afile.read()])
            afile.close()

    outfile.close()

Note that this assumes everything in the directory is a file.

  • Usual unicode error (the text is in french...), but i 'm going to retry : SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape (, line 7) File "", line 7 dirpath = 'C:\Users\ettor\Desktop\Nouveau dossier' ^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape – Ettore Rizza Jan 28 '17 at 19:14
  • If you put an r in front of `'C:\Users\ettor\Desktop\Nouveau dossier'` (so that it becomes `r'C:\Users\ettor\Desktop\Nouveau dossier'`), that should solve this problem (see http://stackoverflow.com/questions/1347791/unicode-error-unicodeescape-codec-cant-decode-bytes-cannot-open-text-file). If your files are not all ASCII (i.e. contain Unicode), I don't know if that will be a problem or not. – Nathaniel Verhaaren Jan 28 '17 at 19:31
  • Doesn't work even with yhe short sample i've posted : UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1792: character maps to I find these stories of unicode incredible. A child could copy and paste the contents of each text file into a cell in a CSV, but automating the thing is a headache. – Ettore Rizza Jan 28 '17 at 21:02
  • Well, at least it seems to be failing on actual Unicode now, not interpreting your file path as containing Unicode because of an escape character. I've written scripts going through a large number of files and putting parts of their contents into a different form (XML, JSON, CSV), and Unicode was always a headache. I'll try to remember how I solved it. – Nathaniel Verhaaren Jan 28 '17 at 22:14
4

Can be written slightly more compactly using pathlib.

>>> import os
>>> os.chdir('c:/scratch/folder to process')
>>> from pathlib import Path
>>> with open('big.csv', 'w') as out_file:
...     csv_out = csv.writer(out_file)
...     csv_out.writerow(['FileName', 'Content'])
...     for fileName in Path('.').glob('*.txt'):
...         csv_out.writerow([str(fileName),open(str(fileName.absolute())).read().strip()])

The items yielded by this glob provide access to both the full pathname and the filename, hence no need for concatenations.

EDIT: I've examined one of the text files and found that one of the characters that chokes processing looks like 'fi' but is actually these two characters together as a single character. Given the likely practical use to which this csv will be put I suggest the following processing, which ignores weird characters like that one. I strip out endlines because I suspect this makes csv processing more complicated, and a possible topic for another question.

import csv
from pathlib import Path

with open('big.csv', 'w', encoding='Latin-1') as out_file:
    csv_out = csv.writer(out_file)
    csv_out.writerow(['FileName', 'Content'])
    for fileName in Path('.').glob('*.txt'):
        lines = [ ]
        with open(str(fileName.absolute()),'rb') as one_text:
            for line in one_text.readlines():
                lines.append(line.decode(encoding='Latin-1',errors='ignore').strip())
        csv_out.writerow([str(fileName),' '.join(lines)])
Bill Bell
  • 21,021
  • 5
  • 43
  • 58
  • Il fonctionne avec le français. – Bill Bell Jan 28 '17 at 19:27
  • Thanx Bill, but i have another Unicode Error : SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape Before voting for the best solution, maybe it would be wiser that I clean the files. I did not think it would be so difficult to copy french text into cells. – Ettore Rizza Jan 28 '17 at 19:33
  • Can you post that text file somewhere that's accessible to us? Or, preferably, if it's large, post the part that contains the bytes that generate the error condition. – Bill Bell Jan 28 '17 at 19:52
  • Of course. I've added a sample in the question itself. It's basically a collection of text extracted from pdf documents. – Ettore Rizza Jan 28 '17 at 20:03
  • Thanks, I'm looking at it. – Bill Bell Jan 28 '17 at 20:12
  • @EttoreRizza: Remedy for the bad characters added as an edit. – Bill Bell Jan 28 '17 at 22:48
  • Hi Bill. It looks like your second solution works with every files. Thank you very much ! – Ettore Rizza Jan 29 '17 at 12:01
2

If your txt files are not in table format, you might be better off using readLines(). This is one way to do it in base R:

setwd("~/your/file/path/to/txt_files_dir") 
txt_files <- list.files()
list_of_reads <- lapply(txt_files, readLines)
df_of_reads <- data.frame(file_name = txt_files, contents = do.call(rbind, list_of_reads))
write.csv(df_of_reads, "one_big_CSV.csv", row.names = F)
Nate
  • 10,361
  • 3
  • 33
  • 40
  • There is a problem with the data.frame newly created : Error in is.data.frame(x) : object 'df_of_reads' not found I must change my locale to get the error in english, but i thing it is understandable : Error in data.frame(file_name = txt_files, contents = do.call(rbind, list_of_reads)) : les arguments impliquent des nombres de lignes différents : 41, 40 In addition: Warning message: In (function (..., deparse.level = 1) : number of columns of result is not a multiple of vector length (arg 1) – Ettore Rizza Jan 28 '17 at 19:00
  • Tables ? No, sorry, maybe i wasn't clear. All my .txt are plain text. Here is a sample of list_of_reads : [42] " '~1?'::\"~ .• ~\"" [43] " ~" [44] " La présente attestation ne vaut pasrelevé de note." – Ettore Rizza Jan 28 '17 at 19:09
  • those nested double quotes are going to cause you problems, maybe thing about removing them with `gsub()` – Nate Jan 28 '17 at 19:13
  • if you can replace the inner double quotes with single quotes it will work, something like this `list("'~1?'::\'~ .• ~\'", " ~" )` – Nate Jan 28 '17 at 19:14
  • Unicode characters are a pain... I will try again tomorrow morning with rested head to not turn this question into a chat. Thank you for everything in any case ! – Ettore Rizza Jan 28 '17 at 19:20
  • I think when encoding to csv, you double the double quotes inside any quotes, so it should be fine. Or am I missing something about this particular situation? – Nathaniel Verhaaren Jan 28 '17 at 19:22
  • it is a problem with interpreting the list structure in R, nested double quotes cause element confusion – Nate Jan 28 '17 at 19:25
  • Makes sense. I'm not familiar with R. – Nathaniel Verhaaren Jan 28 '17 at 19:32
  • Your solution is short and elegant, Nathan, but it produces a lot of columns. I'll look tomorrow for what's wrong. – Ettore Rizza Jan 28 '17 at 20:26