3

I have imported a csv file using pandas.

My dataframe has multiple columns titled "Farm", "Total Apples" and "Good Apples".

The numerical data imported for "Total Apples" and "Good Apples" contains commas to indicate thousands e.g. 1,200 etc. I want to remove the comma so the data looks like 1200 etc.

The variable type for the "Total Apples" and "Good Apples" columns comes up as object.

I tried using df.str.replace and df.strip but have not been successful.

Also tried to change the variable type from object to string and object to integer but couldn't make it work.

Any help would be greatly appreciated.

****EDIT****

Excerpt of data from csv file imported using pd.read_csv:

Farm_Name   Total Apples    Good Apples
EM  18,327  14,176
EE  18,785  14,146
IW  635 486
L   33,929  24,586
NE  12,497  9,609
NW  30,756  23,765
SC  8,515   6,438
SE  22,896  17,914
SW  11,972  9,114
WM  27,251  20,931
Y   21,495  16,662
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
djhc
  • 115
  • 2
  • 2
  • 7

2 Answers2

7

I think you can add parameter thousands to read_csv, then values in columns Total Apples and Good Apples are converted to integers:

Maybe your separator is different, dont forget change it. If separator is whitespace, change it to sep='\s+'.

import pandas as pd
import io

temp=u"""Farm_Name;Total Apples;Good Apples
EM;18,327;14,176
EE;18,785;14,146
IW;635;486
L;33,929;24,586
NE;12,497;9,609
NW;30,756;23,765
SC;8,515;6,438
SE;22,896;17,914
SW;11,972;9,114
WM;27,251;20,931
Y;21,495;16,662"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";",thousands=',')
print df
   Farm_Name  Total Apples  Good Apples
0         EM         18327        14176
1         EE         18785        14146
2         IW           635          486
3          L         33929        24586
4         NE         12497         9609
5         NW         30756        23765
6         SC          8515         6438
7         SE         22896        17914
8         SW         11972         9114
9         WM         27251        20931
10         Y         21495        16662
print df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
Farm_Name       11 non-null object
Total Apples    11 non-null int64
Good Apples     11 non-null int64
dtypes: int64(2), object(1)
memory usage: 336.0+ bytes
None
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • This worked but I had to add this to my read_csv line: `sep=None, thousands=',', engine='python'` Thank you for your help @jezrael! – djhc Apr 07 '16 at 14:52
1

try this:

locale.setlocale(locale.LC_NUMERIC, '')
df = df[['Farm Name']].join(df[['Total Apples', 'Good Apples']].applymap(locale.atof))
Grr
  • 15,553
  • 7
  • 65
  • 85
  • I tried this and this error message appeared: `ValueError: ('could not convert string to float: -', u'occurred at index Farm')` I then tried specifying a column using `df[[2]].applymap(locale.atof)` but this error appeared: `ValueError: ('invalid literal for float(): 1,200', u'occurred at index Total Apples')` – djhc Apr 07 '16 at 13:17
  • Can you post an excerpt of your data? – Grr Apr 07 '16 at 13:45
  • 1
    have you tried calling `df['Total Apples'].apply(locale.atof)`? – Grr Apr 07 '16 at 13:50
  • Edited original question @Grr to include data. I tried your suggestion but got this error message: `TypeError: unsupported operand type(s) for /: 'str' and 'str'` – djhc Apr 07 '16 at 14:14
  • Thank you for your help @Grr. Couldn't get it to work but managed it using the answer below. – djhc Apr 07 '16 at 14:53