1

I have been attempting to combine a large number of csv files (46 files, about 86 MB each about 1 million rows each) using the windows command line with command:

copy *.csv output.csv

This has worked for me in the past but apparently with this dataset it is not working. Upon completion the dataset produces several "inconsistent number of columns detected" errors. They always occur at the same location (for instance, row 12944). When looking at the rows with errors, it appears to be cutting off the first couple of columns and shifting the data left, causing the errors in that row but does not appear to affect the data below it. Strange.

The issue is if I just use say, 3 or less files, there is no error at location 12944 nor does there appear to be an issue with the data if I inspect it via the individual files.

Also tried using a python script to do a similar thing:

import glob
import pandas as pd

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')

This was impossible to run with all 46 files as I run out of RAM but even trying with 4 files gives similar errors to the command line output.

It's almost as if handling more than 3 files is causing an issue when combining them but I have never seen this error occur. I am totally stumped. Any ideas?

Edit: output when using more than 3 files

  • This question might help you with the python issue. That way you don't have to store the full df in memory, but can simply append your files to the "complete" file : https://stackoverflow.com/questions/17530542/how-to-add-pandas-data-to-an-existing-csv-file – amquack Jan 17 '21 at 22:26
  • I would advise that something like this would seem safer, `copy *.csv output.txt`, then `ren output.txt output.csv` than `copy *.csv output.csv`. – Compo Jan 17 '21 at 23:44
  • I've uploaded a screencap of the output so you can see what I am talking about. Again this only happens when I use more than 3 files. It doesn't make any sense to me. I had to sanitize the output but I think you can see what I mean. It randomly decides to cut off columns from the row and shifts the data over. – DrexelDragon93 Jan 18 '21 at 00:11
  • I tried that doing txt first than changing to csv but the error persists. – DrexelDragon93 Jan 18 '21 at 00:12
  • Could be that one of the files has a byte order mark (BOM) at the start of the file. Simply combining files without removing the BOM when necessary could be the issue. Use `utf-8-sig` for reading the files in Python and it wil remove it, assuming your files are encoded in UTF-8 to begin with... – Mark Tolonen Jan 18 '21 at 00:16
  • That did it. I used notepad++ to remove the BOM tag from the files and used Orangutans python script from below and it worked. Thanks. – DrexelDragon93 Jan 18 '21 at 01:19

1 Answers1

1

To reduce the amount of memory used, you could read a small chunk of the file at a time. This is what shutil.copyfileobj() does for you:

import glob
import shutil

with open('combined_csv.csv', 'wb') as output_fp:
    for file_name in glob.glob('*.csv'):
        with open(file_name, 'rb') as fp:
            shutil.copyfileobj(fp, output_fp)

It's possible depending on the file endings that this will cause corruption when combining, in which case adding something like output_fp.write(b'\n') might help.

Orangutan
  • 1,050
  • 11
  • 15
  • That might help but the issue is even using 4 files to reduce memory as a test, it still gives me an error at that row in the output file. So it's not the size of the dataset – DrexelDragon93 Jan 17 '21 at 22:33
  • Could you try running this and give the specific error? I don't quite understand your description of the error – Orangutan Jan 17 '21 at 22:34
  • I've uploaded a screencap of the output so you can see what I am talking about. Again this only happens when I use more than 3 files. It doesn't make any sense to me. I had to sanitize the output but I think you can see what I mean. It randomly decides to cut off columns from the row and shifts the data over. – DrexelDragon93 Jan 18 '21 at 00:11
  • Ran your script and it still throws that same error at line 12944 with 4 or more files. With 3 files, no error. Dumb founded. – DrexelDragon93 Jan 18 '21 at 00:15
  • Two things I can think of is the newline tip I gave or your input has issues (but I'm assuming you've checked this). The only real thing common to all these methods are the inputs. Maybe there's an issue in the 4th file? It might not even be visible, for example mixing different line endings – Orangutan Jan 18 '21 at 09:09