I have been attempting to combine a large number of csv files (46 files, about 86 MB each about 1 million rows each) using the windows command line with command:
copy *.csv output.csv
This has worked for me in the past but apparently with this dataset it is not working. Upon completion the dataset produces several "inconsistent number of columns detected" errors. They always occur at the same location (for instance, row 12944). When looking at the rows with errors, it appears to be cutting off the first couple of columns and shifting the data left, causing the errors in that row but does not appear to affect the data below it. Strange.
The issue is if I just use say, 3 or less files, there is no error at location 12944 nor does there appear to be an issue with the data if I inspect it via the individual files.
Also tried using a python script to do a similar thing:
import glob
import pandas as pd
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
This was impossible to run with all 46 files as I run out of RAM but even trying with 4 files gives similar errors to the command line output.
It's almost as if handling more than 3 files is causing an issue when combining them but I have never seen this error occur. I am totally stumped. Any ideas?