0

Suppose I have a CSV, named a.csv that looks like this:

foo,bar
aaa,"1234"
bbb,"5678"

I'd like to read in the bar column as strings, as indicated by the double-quotes.

However, when I run the following:

d = pd.read_csv("a.csv")
print(d.dtypes)

It returns:

foo    object
bar     int64
dtype: object

I've tried various combinations of parameters: quoting, quotechar, but can't seem to get bar recognized as strings (i.e. having object rather than int64 as its dtype).

Is it possible to achieve this without explicitly specifying the column type via the dtype parameter?

EDIT: I don't believe the referenced question answers my question. That reference describes how strings are stored as objects, but it doesn't explicitly answer why a quoted column, i.e. bar, is NOT casted to strings. Nor does it answer how to get a string column out of bar.

EDIT 2: I should clarify that perhaps I don't care to return a str type for the bar column but rather I DO NOT want it to return an int64 type for bar. It is a bit odd to me that despite there being explicit quotes around foo column entries, Pandas chooses to ignore the quotes and select int64 for bar's type.

GZ0
  • 4,055
  • 1
  • 10
  • 21
Tom
  • 554
  • 5
  • 22
  • This was marked as a duplicate, but I am unsure why. I provided justification for how this is unique in the EDIT section above. – Tom Jul 29 '19 at 22:24
  • try: `df['bar'] = df['bar'].astype(str)` to do the conversion you are looking for. – Evan W. Jul 29 '19 at 22:32
  • @EvanW. That won't work. @Tom I proposed the duplicate because the answers there explain clearly that strings are stored ad objects. That means that the `dtype` of strings, when stored inside a pandas dataframe, is `object.` Stop. There is nothing you can do, no way to cast it back as strings. However, If you get back the element from the pandas dataframe, e.g. if you do `df['bar'].tolist()` to get a normal python list, that will be a list of strings. – Valentino Jul 29 '19 at 22:52
  • @Valentino, what are you looking to do with the non-object dtype column that you need it to be cast as strings? – Evan W. Jul 29 '19 at 22:59
  • @EvanW. In that case, your answer works. But the `dtype` will be `object`, not `string`. The OP is asking how to get a `dtype = string` and this is not possible. – Valentino Jul 29 '19 at 23:02
  • @Valentino, I got you mixed up with the OP. My apologies. – Evan W. Jul 29 '19 at 23:03
  • @Tom, If you check the type of "1234", it display as class of str. So python consider "1234" as datatype of string. However when you read the csv file python automatically converted it as int64. If you want this field to treat as string you need to explicitly mentioned it as below. d=pd.read_csv("a.csv", dtype={'bar': str}) print(d.dtypes) Hope this answer your question – SUN Jul 29 '19 at 23:57
  • Please refer this links, https://stackoverflow.com/questions/13293810/import-pandas-dataframe-column-as-string-not-int – SUN Jul 29 '19 at 23:59
  • @EvanW thank you but I was looking for a way to perform it directly in the read_csv call and not after – Tom Jul 30 '19 at 02:52
  • @Valentino please see Edit 2 for additional follow-up and clarification – Tom Jul 30 '19 at 02:52
  • @SUN I guess my question is WHY does it choose INT64 despite there being quotes? I understand I can cast with the dtype input parameter but I was surprised it needed to be explicit given the input contains quotes. What kind of INT64 has quotes? – Tom Jul 30 '19 at 02:54
  • 2
    @Valentino @WeNYoBen I think the question is more about handling quotes in `read_csv` rather than making columns into strings. I would say it is not a duplicate and I propose to reopen it. – GZ0 Jul 30 '19 at 03:15
  • Now is clearer what the OP is asking. Then use `df = pd.read_csv(path_of_file, quoting=3)`. The `quoting=3` will do the trick. See [read_csv docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for more details on the `quoting` parameter. However this will keep also the quoting characters. – Valentino Jul 30 '19 at 10:56
  • Use of Quoting, make it datatype as object, but do we really want to see the ' " ' as part of data. I believe, column should have only number. – SUN Jul 30 '19 at 11:45
  • @SUN agree. I had assumed the quotes in each entry would signify the contents signify a string is enclosed. – Tom Jul 30 '19 at 14:34
  • The interpretation of quotes is controlled by the `quoting` parameter as defined in the `csv` module https://docs.python.org/3/library/csv.html#module-contents (the definitions are at the end of the section). – GZ0 Jul 30 '19 at 18:58
  • Usually whenever there is special character then only it wrapped in Quotes else it is plain text and that could be the idea behind open_csv ( in case of alphanumeric or special characters it treat as object whereas numeric it implicit conversion int64.) – SUN Jul 31 '19 at 10:14
  • So let's recap here. (1) If I have quotes around the `foo` column entries: I can read it in as an `object` type with `quoting=3`, but it keeps the quotes. (2) If I don't have quotes around the object, there is no direct parameter in `read_csv` (other than explicit `dtype=` setting) that allows it to know that it is an object and not an integer in this case. Is that correct? – Tom Jul 31 '19 at 10:25
  • According to my research and investigation, yes I agree with you. – SUN Jul 31 '19 at 10:37

0 Answers0