0

Hi I'm using 3GB txt file and want to change it to CSV but it gives error_bad_lines

ParserError: '  ' expected after '"'

Code I am using

df1 = df.read_csv("path\\logs.txt", delimiter = "\t", encoding = 'cp437',engine="python")
df1.to_csv("C:\\Data\\log1.csv",quotechar='"',error_bad_lines=False, header=None, on_bad_lines='skip')
DYZ
  • 55,249
  • 10
  • 64
  • 93
  • 1
    Don’t post the original file, but debug down to a few lines that caused the problem and post those – Mark Tolonen Nov 26 '22 at 07:16
  • You may need to pre-process the file - as described here: https://stackoverflow.com/questions/55010807/pandas-errors-parsererror-expected-after – ScottC Nov 26 '22 at 07:31
  • Would it be possible to provide maybe the first 5 rows of the csv file - to see what you are dealing with ? – ScottC Nov 26 '22 at 07:39
  • Hi Scot, I would love to provide you the complete file only, as I am new here, I can not paste here, its too big single row only consist 2000 letter, any way how to send it to you – Prashant Sharma Nov 26 '22 at 07:46
  • Use the option `nrows=...` incrementally to find out which row causes the problem. Inspect or post that row. – DYZ Nov 26 '22 at 07:47
  • @DYZ, thanks.. can you please tell me the command, is it like nrows=2000, but file is so large, having 3-4 millions rows – Prashant Sharma Nov 26 '22 at 07:49
  • So what? Keep searching until you find the offending row. That's the way. – DYZ Nov 26 '22 at 07:56
  • can you please tell me what it mean: ParserError: ' ' expected after '"' !!! is it like I have some "space" in any column just after "double quotes"? – Prashant Sharma Nov 26 '22 at 08:15
  • I have updated my answer to help you get this issue fixed. – ScottC Nov 26 '22 at 08:32

2 Answers2

0

The following code locates unwanted quotation marks (' and ") between each record or tab, and replaces it with nothing. It then replaces the tab (\t) with a comma (,).

This script uses regex to locate the unwanted quotation marks.

import re

# Use regex to locate unwanted quotation marks
pattern = re.compile(r"(?!^|\"$)[\"\']")

new_file = open("C:\\Data\\log1.csv", "a")

# Read the file
with open("path\\logs.txt", "r") as f:
    for line in f.readlines():
        new_l = ""
        for l in line.split('\t'):
            
            # Replace the unwanted quotation marks
            l = re.sub(pattern, "", l)
            if new_l == "":
                new_l = new_l + l
            else:
                new_l = new_l + ',' + l
        
        # Write the line to the new file        
        new_file.write(new_l)

new_file.close()

The reason you are seeing the issue that you are seeing, is that you have an unwanted quotation mark within the record. For example:

"The"\t"quick brown"" fox "jumps over the"\t"lazy dog"
ScottC
  • 3,941
  • 1
  • 6
  • 20
  • do you mean I have to use single slash ("\") instead of double? this also doesn't work – Prashant Sharma Nov 26 '22 at 07:24
  • Thanks a lot @Scott for helping, I'm using the same code which you suggested but still same error: ParserError: ' ' expected after '"', I guess there should be some problem with my input file but its so big I can not even open it manually – Prashant Sharma Nov 26 '22 at 07:31
  • I have updated my answer @PrashantSharma - this should hopefully work now, – ScottC Nov 26 '22 at 08:46
0

Add on_bad_lines=‘warn’ to your read_csv. Looks like there is some wrong line.

  • Thanks Kallaghaan.. but , I used both "warn" , "skip" but still it didn't work. Again its showing this error: ParserError: ' ' expected after '"' – Prashant Sharma Nov 26 '22 at 07:41