Pandas: Read csv with quoted values, comma as decimal separator, and period as digit grouping symbol

Question

UPDATE: When initally posting the question, I did not realize that . is used in some entries as the digit grouping symbol. However, this information is crucial to the problem. Here is the original question:

I am currently trying to import a csv file in pandas that has a less than perfect format, as all the values including numbers are quoted. The format looks like this:

Date;Type;Amount;Currency;Category;Person;Account;Counter Account;Group;Note
"19.02.17";"Expenses";"-36,37";"EUR";"Groceries";"";"Bank account";"";"";""

Now, I have tried importing this using the following command:

import pandas

dtypes = {"Type":"string", "Amount": "float"}
table = pandas.read_csv("data.csv", delimiter = ";", decimal = ",", parse_dates = ["Date"], dtype = dtypes, quoting = 3)

So I have basically been trying to tell pandas that the decimal separator is comma, that the field delimiter is semicolon, and that the column "Amount" should be parsed as floats. However, trying to parse the file, I still get the error message:

ValueError: could not convert string to float: '689,15'"

I assume the combination of the quotes and the comma decimal separator somehow is too much for pandas, even though I think I have technically provided it with all the information it needs.

The file is an export from a third-party program, so unfortunately I have no influence on the format. Does anyone know how to get pandas to swallow this?

Bonus question: If I read this file without providing explicit data types, I don't get any columns of type "string" as I would have expected, but instead "object" is used. Why is that?

Does this answer your question? [pandas data with double quote](https://stackoverflow.com/questions/51359010/pandas-data-with-double-quote) — Cubix48, Feb 26 '22 at 14:07
Thanks, but I think I would prefer a solution that works with just pandas straight away. I have also found a solution now, using the `converters` argument of the `read_csv` function (see my own answer). — bweber, Feb 27 '22 at 11:03

Cubix48 · Accepted Answer · 2022-02-27T14:23:48.357

4

What about that ?

import pandas

table = pandas.read_csv("data.csv", sep=";", decimal=",")

print(table["Amount"][0])  # -36.37
print(type(table["Amount"][0]))   # <class 'numpy.float64'>
print(table["Amount"][0] + 36.37)  # 0.0

Pandas automatically detects a number and converts it to numpy.float64.

Edit:

As @bweber discovered, some values in data.csv contained more than 3 digits, and used a digit grouping symbol '.'. In order to convert the String to Integer, the symbol used must be passed to the read_csv() method:

table = pandas.read_csv("data.csv", sep=";", decimal=",", thousands='.')

edited Feb 27 '22 at 14:23

answered Feb 27 '22 at 11:19

Cubix48

2,607
2
5
17

Ok, thanks! You helped me find the real problem. I guess the information in my question was not sufficient. Because if I just try it with the exemplary line that I posted there, it actually works with your code. But as soon as I try it with the whole csv file, it doesn't work anymore. That puzzled me. So, I did a little bit of investigation, and it turns out that the actual problem seems to be that there are some entries with more than three digits, and those use a digit grouping symbol ('.'). And that is what _actually_ causes the parsing problem. – bweber Feb 27 '22 at 14:04
With that figured out, passing `thousands = '.'` to `read_csv` solves the problem. I will update my question accordingly. If you may include that information in your answer I will gladly make it the accepted answer. – bweber Feb 27 '22 at 14:06
1

I updated my answer. Glad I could help find the real problem ! – Cubix48 Feb 27 '22 at 14:25
1

Thanks! I guess if pandas error message would have been better, I would have gotten there earlier, but the value it complains about doesn't even use the grouping symbol. – bweber Feb 27 '22 at 14:34

bweber · Answer 2 · 2022-02-27T14:12:46.043

UPDATE: While this answer still works, there is an easier solution. I initially did not realise that some of the entries use . as a digit grouping symbol, hence the confusion.

Ok, I found a way that works using the converters argument to pass a lambda that does the conversion. I wasn't able to find a simpler solution so far.

toFloat = lambda x: float(x.replace(".", "").replace(",", "."))
table = pandas.read_csv("data.csv", delimiter = ";", decimal = ",", parse_dates = ["Date"], converters = {"Amount": toFloat})

The lambda toFloat takes a string, removes all occurrences of the . character (used as digit grouping symbol), replaces all occurrences of , by . and then converts the resulting string to float. This lambda is then passed as a converter for the column "Amount" to the read_csv function. Probably now also the decimal argument is expendable.

I am going to leave this question open for now to see if anyone can come up with a simpler solution.

Pandas: Read csv with quoted values, comma as decimal separator, and period as digit grouping symbol

2 Answers2

Edit: