2

I ran across a new problem today. I discovered some data that I'm working with, that looks like this (in a csv file):

Male,White,All Combined,1989,30-31,31,"59,546","18,141","328,235"
Male,White,Felony - Drug,1989,30-31,31,"3,861","1,176","328,235"
Male,White,Felony - Other,1989,30-31,31,"2,626",800,"328,235"
Male,White,Felony - Property,1989,30-31,31,"3,468","1,057","328,235"
Male,White,Felony - Violent/Sex,1989,30-31,31,"3,150",960,"328,235"
Male,White,Misdemeanor,1989,30-31,31,"46,441","14,149","328,235"
Male,White,Status,1989,30-31,31,0,0,"328,235"

It's hard to see the problem, so let me highlight the second to last column:

"18,141"
"1,176"
800
"1,057"
960
"14,149"
0

The problem is values with commas are being exported as strings, while values without commas are exported as numbers. To be clear, the data should be interpreted as:

18141
1176
800
1057
960
14149
0

That is, it should all be interpreted as numeric values.

However, it makes me think that some "standard" application is exporting data like this. For the moment, let's say that it is Excel.

Is there any effective way to try to import flat files with this varying data type within the same column? Both R (read_csv from readr library) and Python's Pandas (read_csv), using their standard flags, interpreted this data by doing the following:

  • Presuming they should all be numbers (regardless of whether or not quotes are present in all "cells").
  • Presuming that the commas, therefore, must be the European-style of using a comma for a decimal place (instead of the US period).
  • So, both packages interpreted that column as follows:

    18.141
    1.176
    800
    1.057
    960
    14.149
    0
    

In a way, it's impressive that both R (read_csv from readr library) and Pandas (read_csv) could both handle this incongruity and get the guesses almost right.

However, is there a flag that I can set or a package out there which can handle this sort of thing? For instance, a flag to say "remove quoted commas, they are most certainly not European for our US decimal place.

If not, is there enough of a need to contribute to this via forking either of their GitHub repos?

Mike Williamson
  • 4,915
  • 14
  • 67
  • 104
  • If you inconsistent formatting within the same column then you need to perform some pre/post-processing – EdChum Sep 17 '15 at 18:43
  • @Prune I edited the question, so that it is hopefully clearer. In short, I do need what you wrote. – Mike Williamson Sep 17 '15 at 20:19
  • @EdChum Yes, you're absolutely right that some pre-processing is needed. I took care of it. However, this appears to be a common means of exporting by some apps. In fact, I just tested this on Excel, and if I place the comma, it will export as shown here (different format based upon whether row has a comma). My goal, here, is to be able to properly interpret this column without a problem. It is "easy" if, for instance, a flag is added where I can say "not European", therefore, any number can be scrubbed for commas, instead of interpreting as a decimal. – Mike Williamson Sep 17 '15 at 20:24
  • What OS are you on? We can probably do this quickly at the command line – Rich Scriven Sep 17 '15 at 20:50
  • @RichardScriven Yes, command line pre-processing can handle it, and I know how to do that. I'm instead looking to more effectively use the tools themselves (R or Pandas). This will allow for a more generalized solution across a wider range of data munging problems. – Mike Williamson Sep 22 '15 at 21:58

1 Answers1

1

pandas.read_csv has thousands=',' parameter which you can set to a comma so pandas will read your column as

0    18141
1     1176
2      800
3     1057
4      960
5    14149
6        0

Also there is a converters parameter that takes a dictionary of columns and corresponding functions for applying to each column. You can use it for more complex preprocessing, something like this (does the same thing):

pd.read_csv('data.csv', converters={'column_name': lambda x: int(x.replace(',',''))})

hellpanderr
  • 5,581
  • 3
  • 33
  • 43
  • While "thousands" does solve the problem, I was looking more generically for a way to say that the quote and/or escape characters may vary WITHIN the column. Your converters parameter will do exactly that, in a more generic fashion! Thanks! – Mike Williamson Sep 22 '15 at 21:54