0

I have a few CSV files with the same header.
To optimize my work I merged the files to get one pd.DataFrame:

file1.csv > file_merged.csv
file2.csv | tail -n +2 > file_merged.csv

But during pd.read_csv I get an error:

    228         try:
    229             if self.low_memory:
--> 230                 chunks = self._reader.read_low_memory(nrows)
    231                 # destructive to chunks
    232                 data = _concatenate_chunks(chunks)

~/.local/lib/python3.10/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read_low_memory()

~/.local/lib/python3.10/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

~/.local/lib/python3.10/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

~/.local/lib/python3.10/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 4 fields in line 1391, saw 7

What's the problem? The files can be read separateelly and have the same header (I remembered removed the headers (look: above example)).

maciejwww
  • 1,067
  • 1
  • 13
  • 26

1 Answers1

2

Most likely one of your files (let's say file1.csv) does not end with a newline character. By merging the files with the commands you provided, the content of file2.csv starts on the end of the last line of file1.csv which results in one "merged" row with more columns than expected. You can fix this by ensuring that each csv file ends with a newline character.

Illustrative example:

file1.csv (missing endline character at the end of the file):

column1,column2,column3
0,0,0
1,1,1

file2.csv:

column1,column2,column3
2,2,2
3,3,3

file_merged.csv:

column1,column2,column3
0,0,0
1,1,12,2,2
3,3,3

This answer explains well why all text files should end with a newline.

MarGenDo
  • 727
  • 1
  • 8
  • 17
  • It's not a problem - I've checked it by hand with `head -n`. – maciejwww Aug 02 '23 at 19:38
  • 1
    The error message suggests an error in line 1391 of the csv file containing 7 columns instead of 4, which could be the result of the missing newline character. – MarGenDo Aug 02 '23 at 19:45
  • Also, the head command is irrelevant in this case, the problem is at the end of the files, not at the beginning. – MarGenDo Aug 02 '23 at 19:47
  • Ye, I'll make sure again. – maciejwww Aug 02 '23 at 20:11
  • I checked out the problem - the last line of 1st file and the 2nd line of the 2nd file are separate. As I said I quickly used `head -n N` command where N=1391, 1392. – maciejwww Aug 02 '23 at 20:43
  • 1
    it would probably be useful to add to the question an example of say lines 1385 to 1400, with your best guess as to what line 1391 is. It seems to have problems tokenizing a line in that part of the file ... so what are those lines? Also (just for kicks and giggles) I would check if the newlines are consistent between what the files have already vs when you append using these command line operations (not likely to be a problem ... but who knows if CR vs CRLF could be a problem). – topsail Aug 02 '23 at 20:53
  • @MarGenDo I've check the file once again with text editor, as BigBen's suggested to and you were right - there was no endline character. It also seems my terminal window was fitted so perfectly to collapse lines exactly at the desired end of the line. :| As usually, a simplest reason is the reason. And I need more rest. Thank you! – maciejwww Aug 02 '23 at 23:34