3

Trying to answer this question Get List of Unique String per Column we ran into a different problem from my dataset. When I import this CSV file to the dataframe every column is OBJECT type, we need to convert the columns that are just number to real (number) dtype and those that are not number to String dtype.

Is there a way to achieve this?

Download the data sample from here

I have tried following code from following article Pandas: change data type of columns but did not work.

df = pd.DataFrame(a, columns=['col1','col2','col3'])

As always thanks for your help

Community
  • 1
  • 1
racekiller
  • 115
  • 1
  • 2
  • 13
  • Automatic conversion of all columns to object type usually happens when there are commas or other non-numeric characters in the otherwise numeric columns. You could try something like `df.replace(",", "",regex=True).astype(np.int64)` to remove the characters and convert data into to some numeric type – Gene Burinsky Sep 30 '16 at 22:37

1 Answers1

5

Option 1
use pd.to_numeric in an apply

df.apply(pd.to_numeric, errors='ignore')

Option 2
use pd.to_numeric on df.values.ravel

cvrtd = pd.to_numeric(df.values.ravel(), errors='coerce').reshape(-1, len(df.columns))
pd.DataFrame(np.where(np.isnan(cvrtd), df.values, cvrtd), df.index, df.columns)

Note
These are not exactly the same. For some column that contains mixed values, option 2 converts what it can while option 2 leaves everything in that column an object. Looking at your file, I'd choose option 1.


Timing

df = pd.read_csv('HistorianDataSample/HistorianDataSample.csv', skiprows=[1, 2])

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • @Jeff how do I apply `pd.numeric` to an entire dataframe? That's what I'm trying to do here. Further, I could've stacked `df` first, but each column may be different. The `errors='ignore'` will stop the conversion if any element in the stacked series doesn't convert. If I use `errors='coerce'` it will `nan` any values not numeric. I can only think of using `apply` to operate on each column separately. It will still be vectorized for each column. – piRSquared Sep 30 '16 at 23:04
  • use .ravel() and reshape – Jeff Sep 30 '16 at 23:17
  • @Jeff using apply in this case still seems a better option. – piRSquared Sep 30 '16 at 23:50
  • @Jeff If I do `pd.to_numeric( df, errors='coerce' )`, I get `TypeError: arg must be a list, tuple, 1-d array, or Series`, but if use `df.apply( pd.to_numeric, errors='coerce' )`, it works. – SebMa Jul 09 '18 at 17:34
  • 1
    @SebMa `apply` takes `pd.to_numeric` and applies it to each column of the dataframe. When you pass the dataframe to the function `pd.to_numeric(df)` it doesn't know what to do. In the example above, I force the dataframe to be one dimensional with `ravel` and then reshape the results back to the same dimensions as `df`. The point is, `pd.to_numeric(my_dataframe)` is not expected to work. `pd.to_numeric` will work on a singleton value of a 1-dimensional thing. `pd.to_numeric(pd.Series(['1', '2']))`, `pd.to_numeric('3')` both work. But `pd.to_numeric([['1', '2'], ['3', '4']])` Does not. – piRSquared Jul 09 '18 at 17:52
  • @piRSquared Thanks. In your timing results `df.apply` seems to be 20x faster than `pd.to_numeric`. If `pd.to_numeric` is vectorized, why is it slower ? – SebMa Jul 09 '18 at 17:59
  • 1
    @SebMa, Jeff is the man! I suspect he missed that I was applying over a dataframe of columns and he thought I was applying over elements of a series. In the case of a series, `pd.Series(['1', '2']).apply(pd.to_numeric)` is absolutely silly. In this specific case, when we apply, it gets used over each column in a vectorized way. The results would be different if we had 100's or 1000's of columns. – piRSquared Jul 09 '18 at 18:02
  • @piRSquared Sorry I answer to the wrong person. Thanks for your answer :) – SebMa Jul 09 '18 at 22:46
  • @piRSquared In the _**Note**_ it says _"option 2 converts what it can while option 2 leaves everything in that column an object"_, so which one is which? Thx. – Alex R. Apr 29 '23 at 18:44