0

I'm working on taking csv files and putting them into a postgreSQL database. For one of the files though, every field is surrounded by quotes (When looking at it in Excel it looks normal. In notepad though, one row looks like "Firstname","Lastname","CellNumber","HomeNumber",etc. when it should look like Firstname,Lastname,CellNumber,HomeNumber). It breaks when I tried to load it into SQL.

I tried loading the file into python to do data cleaning, but i'm getting an error:

This is the code I'm running to load in the file in python:

import pandas as pd

logics = pd.read_csv("test.csv")

and this is the error I'm getting:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 28682: invalid continuation byte

I tried encoding it into utf-8, but that gave me a different error. code:

import pandas as pd

logics = pd.read_csv("test.csv", encoding= 'utf-8')

error:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 12 fields in line 53, saw 14

For whatever reason, when I manually save the file in file explorer as UTF-8 and then save it back again as a CSV file it removes the quotation marks, but I need to automate this process. Is there any way I can use python to remove these quotation marks? Is it just some different kind of encoding?

  • 2
    Welcome to Stack Overflow. Please take the 2-minute [tour]. Moreover, open [Help] and read at least [ask]. Then, [edit] your question to improve your [mcve]. In particular, clarify your specific problem or provide additional details to highlight exactly what you need. As currently written, it's hard to guess what you're asking exactly. – JosefZ Jan 13 '23 at 16:16
  • Hi. I wouldn't worry about the double quotes: they're a normal part of the CSV encoding. True, they aren't required for the data you've shared, but they're also not wrong... you could just leave them. That said, if you could just open the CSV with pandas and save it again, that should take care of them as Pandas will probably not include extraneous quotes. But you've got this error `Expected 12 fields in line 53, saw 14`. So, open the CSV in Excel and scroll down line/row 53 and see if it's missing some values. If you can fix that line, great. If not... https://stackoverflow.com/q/27020216. – Zach Young Jan 15 '23 at 01:52
  • Please ask 1 specific researched non-duplicate question. Please either ask about 1 bad definition/query/function with the obligatory [mre], including why you think it should return something else at the 1st subexpression that it doesn't give what you expect, justified by reference to authoritative documentation, or ask about your overall goal giving working parts you can do & ideally a [mre]. But please ask about bad code 1st because misconceptions get in the way of your goal. And bad code doesn't tell us what you wish it would do. [ask] [Help] – philipxy Jan 18 '23 at 05:01

1 Answers1

-1

So you can add more to this, maybe pull out some of the functionality into a function called "clean_line". Below should go through your csv, and remove all " characters in any of the lines. No real need for the pandas overhead on this one, using the standard python libraries should make it faster as well.

with open("test.csv",'r')as f:
    lines = f.readlines()
with open("output.csv", 'w') as f:
    output=[]
    for line in lines:
        output.append(line.replace('"',''))
    f.writelines(output)
  • Quotes are typically necessary in CSV to escape commas that are part of the data—not to be interpreted as delimiters. A file with unnecessary quotes isn’t a problem. Doing a wholesale replacement of quotes could corrupt the CSV encoding of the data. – Zach Young Jan 15 '23 at 01:43