1

I am trying to convert a csv into a required format where a text field contains currency data in th format "A$ 1,000.00"

I can replace the "A$ " using: df.Credit.str.replace('A$ ','',regex=False) and then later convert the remaining string value to a float by casting it at the time of use, but I forgot the thousands are comma separated.

When importing the csv into the dataframe, I can use the thousands separator keyword, but because the column is imported as a string, it's not given a numeric value (because of the 'A$ ').

So I need to run the conversion of the comma AFTER importing it.

Is there a way I can do it all in the initial read of the CSV?

This is what I'd come up with so far, but doesn't work because it is out of order:

import pandas as pd

from collections import defaultdict

file = 'mydatafile.csv'  
data = pd.read\_csv(file,thousands=',')  
data.Credit = data.Credit.str.replace('A$ ','',regex=False)  

sales = defaultdict(float)  

for k,v in data.iterrows():  
sales[k]+=float(v.Credit)

print(dict(sales))

There are a couple of similar questions however they lack answers or don’t apply, eg:

Pandas: Read CSV: ValueError: could not convert string to float I’m already using the thousands separator without success.

Panda load csv string to float Again, not the same, and the solution is unrelated to my problem

edit: I have also found this similar, but opposite question where someone is wanting to apply a format over the data, where-as I'd more like to remove it.

Can I somehow apply a regex that encompases both the removal of the A$ and subsequent commas? Or is there a way to have the data be 'accepted' just in the way speadsheets 'ignore' currency symbols? I know this isn't a spreadsheet, but if pandas could be told that a string of this format is actually a float, that would solve my issue.

edit: for the time being, I have implemented Björn's answer with an extra .str to make it work, such that:

data.Credit = data.Credit.str.replace('A$ ','',regex=False).str.replace(',','').astype(float)

complete code:

import pandas as pd

from collections import defaultdict

file = 'mydatafile.csv'  
data = pd.read\_csv(file,thousands=',')  
data.Credit = data.Credit.str.replace('A$ ','',regex=False).str.replace(',','').astype(float)

sales = defaultdict(float)  

for k,v in data.iterrows():  
sales[k]+=float(v.Credit)

print(dict(sales))
Madivad
  • 2,999
  • 7
  • 33
  • 60
  • Does this answer your question? [How to convert a string to a number if it has commas in it as thousands separators?](https://stackoverflow.com/questions/1779288/how-to-convert-a-string-to-a-number-if-it-has-commas-in-it-as-thousands-separato) – Björn Apr 20 '20 at 13:20
  • Remove the Substring as you do and then use the `locale.atof` . i.e. like this `data.Credit.apply(locale.atof)` – Björn Apr 20 '20 at 13:23
  • See also [this](https://stackoverflow.com/questions/22137723/convert-number-strings-with-commas-in-pandas-dataframe-to-float) – Björn Apr 20 '20 at 13:24
  • Not really, I do know I can manually convert it myself and I could even pre-process the CSV file, however, I would like to do this in the import in one go because, a) I am doing this repeatedly, with different sets of data and b) the database is huge. I don’t really want to load it in, then iterate over it to change the format of it. I’m hoping to do it on import. *Edit: this comment was regarding the first comment. Thanks for other info. Checking them now.* – Madivad Apr 20 '20 at 13:24

2 Answers2

2

You can use a custom converter in read.csv:

...
import re
...
rx = re.compile(r'A\$\s*([0-9,.]+)')
data = pd.read_csv(io.StringIO(t), converters= {'Credit': lambda x: float(
    rx.match(x).group(1).replace(',',''))})

This code expects the column to have a consistent format and does not care for formatting errors. If this is required, a full function with tests will have to be used.


If you only need to process lines where the field only contains NaN string and/or blank characters, you can filter them out with another regex:

rx = re.compile(r'A\$\s*([0-9,.]+)')
empty = re.compile(r'\s*(?:nan)?\s*$', re.I)
data = pd.read_csv(io.StringIO(t), converters= {'Credit': (
    lambda x: np.nan if empty.match(x) else float(
    rx.match(x).group(1).replace(',','')))})
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • ahhh, `converters`, I did not know about this, it does sound like what I am after! The format is consistent(-ish) except that when there is a thousands separator, because it is a CSV, that field will be surrounded by quotes. I suspect that could be facilitated in the regex though. Thanks, this gives me a great place to start – Madivad Apr 20 '20 at 14:11
  • as it turns out, a full function with tests is required. I just kept getting errors and digging deeper I found there are the occasional entries that are `NaN` or `' '` which just caused me endless worlds of pain. – Madivad Apr 20 '20 at 15:25
  • 1
    @Madivad: If you only have empty fields or NaN, another regex could be enough – Serge Ballesta Apr 20 '20 at 15:54
1

So for visibility here one way to do it:

Remove the substring like you do with:

data.Credit = data.Credit.str.replace('A$ ','',regex=False)  

Next use a combination of locale module and the .apply of pandas

import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' ) 
data.Credit.apply(locale.atof)


Another option would be to chain replace and convert via astype
data.Credit = data.Credit.str.replace('A$ ','',regex=False).replace(',','').astype(float) 
Björn
  • 1,610
  • 2
  • 17
  • 37
  • 1
    Thanks, I think that will do it. I’ll let you know when I get to implement it. – Madivad Apr 20 '20 at 13:34
  • Thanks, as is it didn't work, I got the same `ValueError: could not convert string to float: '1,440.00'`, however, after adding another `.str` in there, it did work. It's ugly, but it gets the job done. I would love for a more elegant way to do this. Thanks again! – Madivad Apr 20 '20 at 14:05