2

A noob question (revised): I read in a .csv file and I tried to specify dtypes as in the following:

import csv
import pandas as pd

cdc = pd.read_csv('myFile.csv', 
dtype = {
'Phenotype': str,
'State': str,
'EventType': str,
'EventYear': str,
'AgeCategory': str,
'NumberTested' : str,
'NumberResistant': str,        
'PercentResistant': str
 })

But after reading the df file, I get:

df.dtypes

Phenotype           object
State               object
EventType           object
EventYear           object
AgeCategory         object
NumberResistant     object
PercentResistant    object
dtype: object

I thought instead I'd get dtypes of string for each column.

I'd like each column to be read as a string because some columns have a mixture of numbers and strings as you'll see below in the .csv example file. Once the file is read in I can start manipulating the gosh darn thing!

Bottom line: I want to clean up the data rows and replace "None Tested" and "Not Defined" with NaNs or zeroes. I can't do that with 'objects,' that I can figure out, anyway.

FYI, I've read in the .csv as a df with no 'dtype' parameter but I get the same problem. After reading the file with no dtypes specfied, I tried creating a new column of integers from existing columns, but the 'object' dtype seems to get in the way of that, too.

I'm stuck. I've looked around and I can't seem to figure this out myself.

Sample input .csv file here: (there are no extra lines between rows, I just tried to make the rows more readable)

Phenotype,State,EventType,EventYear,AgeCategory,NumberTested,NumberResistant,PercentResistant

Acinetobacter,AK,All HAIs,2011, 1-18,2,1,0.5

Acinetobacter,AK,CAUTI,2011, 1-18,0,None Tested,Not Defined

Acinetobacter,AK,CLABSI,2011, 1-18,0,None Tested,Not Defined

Acinetobacter,AK,SSI,2011, 1-18,0,None Tested,Not Defined

Acinetobacter,AK,All HAIs,2011,<1,2,2,1.0

Acinetobacter,AK,CAUTI,2011,<1,0,None Tested,Not Defined

Acinetobacter,AK,CLABSI,2011,<1,0,None Tested,Not Defined

Acinetobacter,AK,SSI,2011,<1,0,None Tested,Not Defined

Acinetobacter,AK,All HAIs,2011,19-64,(1-19),Insufficient Data,Insufficient Data
halfer
  • 19,824
  • 17
  • 99
  • 186
AZBlue
  • 105
  • 7
  • 3
    show your .csv. – eyllanesc Apr 08 '18 at 23:01
  • 1
    you prolly have `NaN` values in those columns – gold_cy Apr 08 '18 at 23:26
  • 1
    Also, strings will be stored as objects as strings do not have a fixed length. – ALollz Apr 08 '18 at 23:29
  • keep the types in quotes like: `dtype = { 'State': 'str', 'NumberTested' : 'int', 'PercentResistant': 'float'})` – YOLO Apr 08 '18 at 23:38
  • "...strings will be stored as objects as strings do not have a fixed length" That's helpful, I'll keep looking along those lines, but i've had no love trying to cast an 'object' to an 'int'. It seems to work in-line, but once I get out of that code block, the value is back to an object! – AZBlue Apr 09 '18 at 01:11
  • I tried eclosing the types in single quotes like the 'str' suggestion, but no luck., Thanks, tho! – AZBlue Apr 09 '18 at 01:12
  • I do have NaNs in the columns, so now I'm trying to just read them in as strings and convert to int, float, etc in the code. But I keep having problems with the 'objects' reverting back to 'object' once I do something like df['somecolumn'] = df['somecolumn'].astype('int') – AZBlue Apr 09 '18 at 01:15
  • @ALollz I did some research on your comment "...strings will be stored as objects as strings do not have a fixed length." My understanding is that strings are immutable... they are fixed (in length, too) and cannot be changed. However, they can be manipulated, ie copied and appended to, etc into other instances of string, but the original cannot be changed. – AZBlue Apr 09 '18 at 03:28
  • 1
    @AZBlue I meant length of the underlying bytes which are storing the data, which is important since pandas is built on NumPy. For instance, every int32 takes up the same number of bytes, whether or not your dataframe has the value 1 or 1523. So even though the string 'hello' is fixed to 5 characters, pandas doesn't have a fixed character string type (meaning all entries will be the same number of characters) to my knowledge, which is why everything becomes an object. – ALollz Apr 09 '18 at 03:49
  • 1
    You'll also run into this object type frequently in integer columns with NaN values, as numpy does not currently support a NaN integer representation – ALollz Apr 09 '18 at 03:51

1 Answers1

1

I wanted to see 'str' returned when I used df.dtypes, well, .dtypes is a numpy function and anything other than a number representation will be returned as 'object.' So, my values are indeed being read in as strings. Doh. I found the answer here: can not convert column type from object to str in python dataframe

This link is helpful, too, for newbies like me: How to get datatypes of all columns using a single command [ Python - Pandas ]?

halfer
  • 19,824
  • 17
  • 99
  • 186
AZBlue
  • 105
  • 7