Pandas data pull - messy strings to float

Question

I am new to Pandas and I am just starting to take in the versatility of the package. While working with a small practice csv file, I pulled the following data in:

Rank    Corporation Sector  Headquarters    Revenue (thousand PLN)  Profit (thousand PLN)   Employees


1.ÿ PKN Orlen SA    oil and gas P?ock   79 037 121  2 396 447   4,445

2.ÿ Lotos Group SA  oil and gas Gda?sk  29 258 539  584 878 5,168

3.ÿ PGE SA  energy  Warsaw  28 111 354  6 165 394   44,317

4.ÿ Jer¢nimo Martins    retail  Kostrzyn    25 285 407  N/A 36,419

5.ÿ PGNiG SA    oil and gas Warsaw  23 003 534  1 711 787   33,071

6.ÿ Tauron Group SA energy  Katowice    20 755 222  1 565 936   26,710

7.ÿ KGHM Polska Mied? SA    mining  Lubin   20 097 392  13 653 597  18,578

8.ÿ Metro Group Poland  retail  Warsaw  17 200 000  N/A 22,556

9.ÿ Fiat Auto Poland SA automotive  Bielsko-Bia?a   16 513 651  83 919  5,303

10.ÿ    Orange Polska   telecommunications  Warsaw  14 922 000  1 785 000   23,805

I have two serious problems with it that I cannot seem to find solution for:

1) data in "Ravenue" and "Profit" columns is pulled in as strings because of funny formatting with spaces between thousands, and I cannot seem to figure out how to make Pandas translate into floating point values.

2) Data under "Rank" column is pulled in as "1.?", "2.?" etc. What's happening there? Again, when I am trying to re-write this data with something more appropriate like "1.", "2." etc. the DataFrame just does not budge.

Ideas? Suggestions? I am also open for outright bashing because my problem might be quite obvious and silly - excuse my lack of experience then :)

"Funny formatting with spaces between thousands" is actually a Polish (and also Russian) way to separate thousands. Unfortunately, the Polish number locale does not seem to be supported, but the first answer to this question http://stackoverflow.com/questions/42937460/how-to-set-a-custom-thousands-separator offers a workaround. As for the second question, I am not sure where the 'ÿ's come from (they actually represent the end of file, EOF), but `df['Rank']=df['Rank'].str.strip('ÿ')` should take care of them. — DYZ, Apr 13 '17 at 05:54
1. You will find you'll often have to 'clean' your data before converting or using it. Try `df['Revenue'] = df['Revenue'].apply(lambda value: float(value.replace(' ','')))` and as for 2, Can you share a sample row from your original CSV? — Quitty, Apr 13 '17 at 05:56
@Quitty Avoid calling `apply` unless absolutely necessary, as it slows down your code by orders of magnitude. `df['Revenue'] = df['Revenue'].str.replace(' ','').astype(float)` is way more efficient. — DYZ, Apr 13 '17 at 05:59
Thanks for answers so far guys, calling df['Rank']=df['Rank'].str.strip('ÿ') on it does not change anything, numbers are still printed as "1.?". Is there any way I could actually just re-write it with proper "1." - "10." strings? — Greem666, Apr 13 '17 at 06:14

score 0 · Answer 1 · answered Apr 13 '17 at 06:15

I would use the converters parameter.

pass this to your pd.read_csv call

def space_float(x):
    return float(x.replace(' ', ''))

converters = {
    'Revenue (thousand PLN)': space_float,
    'Profit (thousand PLN)': space_float,
    'Rank': str.strip
}

pd.read_csv(... converters=converters ...)

Pandas data pull - messy strings to float

1 Answers1