3

I want to specify data types for pandas read_csv. Here's a quick look at something that does work and then doesn't when types are specified. Why doesn't the latter work?

import io
import pandas as pd

csv = """foo,1234567,a,1 
foo,2345678,b,3 
bar,3456789,b,5 
"""

df = pd.read_csv(io.StringIO(csv),
        names=["fb", "num", "loc", "x"])

print(df)

df = pd.read_csv(io.StringIO(csv),
        names=["fb", "num", "loc", "x"], 
        dtype=["|S3", "np.int64", "|S1", "np.int8"])

print(df)

I've updated to make this much simpler and, hopefully, clearer on BrenBarn's suggestion. My real dataset is much larger, but I'd like to use the method to generate types for all my data on import.

Don
  • 857
  • 1
  • 9
  • 19
  • 1
    Have you tried making a simpler dataset and trying with just one or two datatypes to see which one is causing the problem? – BrenBarn Sep 29 '13 at 18:06
  • I'll do that, though the error it throws now suggests (to my novice mind) that I'm not specifying correctly, not that there is a mismatch between my specification and the data. But I'll give it a shot and report back! – Don Sep 29 '13 at 18:28
  • 1
    pandas will convert a specified string dtype, like ``S20`` to ``object`` dtype which represents string types. Why is that a problem? This is the standard way of representing variable length strings (and is actually more efficient than a fixed ``S20`` dtype) – Jeff Sep 29 '13 at 18:43
  • @Jeff Oh, cool. So if `object` is more efficient than `string_` types, then I'm happy with that piece. I'd like to specify all my integer types at `int32` or less rather than `int64`, though. I guess I can try converting them post-import. – Don Sep 29 '13 at 19:02
  • you can either do that or specifiy specific columns (but seeing as you have so many can prob just do it after). – Jeff Sep 29 '13 at 19:18
  • @Jeff When I convert after import, I use the "np." prefix, but I'm still getting errors when I try to specify during import. I would like to learn how to do that if possible: can I specify `object` and `int32` dtypes in a csv import? Do I use `dtypes = ('object', 'int32', etc.)` or do I use some other syntax like `dtypes = ('str', 'np.int32', etc.)`? It seems like whatever I try, I still get `TypeError: data type not understood` – Don Sep 29 '13 at 19:32
  • 1
    see [docs](http://pandas.pydata.org/pandas-docs/dev/io.html#specifying-column-data-types), basically ``dtype = { 'column_1' : np.int32, 'column_2' : np.int64 }``. You don't need to specify object as that will happen automatically for string-like columns – Jeff Sep 29 '13 at 19:49
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/38296/discussion-between-don-and-jeff) – Don Sep 29 '13 at 20:15
  • @BrenBarn I simplified the code as you suggested. Curious what you make of it. Do you know of any way to specify a non-numeric column (string, text, object, etc.) in pandas? – Don Sep 30 '13 at 00:20

1 Answers1

5

As Jeff indicated, my syntax was bad. The names and types have to be zipped into a dic style list of relationships. The code below works, but note that you can't dtype a string width; you can only define it as an object.

import pandas as pd
import io

csv = """foo,1234567,a,1
foo,2345678,b,3
bar,3456789,b,5
"""

df = pd.read_csv(io.StringIO(csv),
        names = ["fb", "num", "ab", "x"], 
        dtype = {"fb" : object, "num" : np.int64, "ab" : object, "x" : np.int8})
print(df)
Don
  • 857
  • 1
  • 9
  • 19
  • 1
    Right, that's why I was asking about the simplification. I was thinking that if you tried to simplify it down you would maybe find it out it didn't work at all, even for numeric types (although I didn't know for sure). It still seems lame that you can't specify actual string dtype though. – BrenBarn Sep 30 '13 at 00:45
  • 1
    pandas doesn't support the internal string types (in fact they are always converted to object). – Jeff Sep 30 '13 at 01:30