3

I'm effectively trying to read the following file into a DataFrame and write it out again, unchanged.

F1,F2,F3,F4,F5
"blah","blah",123,"123","123"
"blue","blue",456,"456",""

I would have expected that pandas.read_csv would interpret all fields in the above as strings (and thus dtype:object) except for F3.

However, F4 is being treated as a numeric field and given dtype:int64
This means that writing the data back again with quoting=csv.QUOTE_NONNUMERIC loses the quotes around F4.

pd.read_csv(test, quotechar='"', sep=',')
Out[90]: 
     F1    F2   F3   F4     F5
0  blah  blah  123  123  123.0
1  blue  blue  456  456    NaN

pd.read_csv("test.txt", quotechar='"', sep=',').dtypes
Out[91]: 
F1    object
F2    object
F3     int64
F4     int64
dtype: object

This also creates a further problem: if any of the F5 values are an empty string (i.e. ""), F5 is instead treated as float64 with a NaN value for the empty strings. Those values will be written back as "", but all the actual values will now have a .0 appended as they're considered floats.

I've played around with various options for the quoting parameter - quoting=csv.QUOTE_NONE is the only one which treats F4 and F5 as a string, but then I obviously end up with the actual quotation marks embedded in the data. I would have expected quoting=csv.QUOTE_NONNUMERIC to do what I want, but it treats all of F3, F4, F5 as float64.

I feel like I'm maybe missing something fundamental here about how dtypes are determined? The only workaround I can think of is to read the header and first data line of every file, parse it for quoted fields, and build an explicit dict on the fly (or read the file first with quoting=csv.QUOTE_NONE and do the same thing based on which columns contain quote marks) then read the file again with the explicit dtypes.

This seems like a long-winded (and slow!) alternative, so I'm hoping someone can point out something I've missed in the read_csv documentation. Thanks.

Extra detail in case anything is relevant:

  • I'm processing a large number of files each with a large (and different) set of columns, so I can't specify explicit dtypes for each one.
  • Data to be treated as text is in double quotes, remaining data is numeric.
  • Obviously I'm not leaving the data completely unchanged, but columns other than the ones I'm adding/amending should be unaffected.
  • The files are well-formed CSVs for the most part, no extraneous spaces between separators
  • Most have some extra header lines which I'm parsing first and skipping with header=nnn when I call the read_csv.
Mick O'Hea
  • 1,619
  • 2
  • 14
  • 20
  • why don't you try df[F4].astype(str) and df[F3].astype(int) ? – technophile_3 Jan 31 '22 at 10:15
  • 1
    @technophile_3 "*I'm processing a large number of files each with a large (and different) set of columns, so I can't specify explicit dtypes for each one.*" – mozway Jan 31 '22 at 10:17
  • @Mick I have to say this is a very good question. I am wondering if this is even possible. I think the quotes are defining the **fields**, to ensure elements inside them are not evaluated as separators, but not the **content** nor the **type**. So, once the fields are set, is there really a difference between `123` and `"123"`?. I would say probably not, as I would expect the parser to only use the quotes to build the fields but then to have `123` in both cases. – mozway Jan 31 '22 at 10:32
  • Pandas interprets quoting as a way to include the delimiter in a column value, not as a "this is a string" indicator. You can easily end up needing to include the delimiter in a numeric column value if you're using commas as delimiters and European-style commas as decimal points. – user2357112 Jan 31 '22 at 10:37
  • Pandas suffers the same issue with `read_json` too. – dsz Aug 18 '23 at 04:28

4 Answers4

1

What you want to achieve is unfortunately not possible.

The CSV quotes define the fields, not the content nor the type.

When the parses sees "123", "12,3" or "ab,c", it uses the quote to keep the comma as part of the field, not to define if the content is numeric or anything else. Then the quotes are not part of the content.

As @user2357112supportsMonica nicely pointed out, this is for example important for handling European decimal separator.

So, back to your question, "123" and 123 are exactly the same in terms of content and you cannot do anything to extract type information from it…

mozway
  • 194,879
  • 13
  • 39
  • 75
  • I disagree - the CSV format is ill defined, but ONLY strings need quotes. Int's and floating points do not (unless you're in France, where you'll likely choose a different column delimiter.) – dsz Aug 18 '23 at 04:27
  • @dsz thank you for your comment, but please read the question again. `123,"123","1,23"` is valid CSV whether or not you like it. You do not always have the choice of the separators. Most importantly, this is not what the question is about. OP asked whether the type could be inferred from the presence of quotes, which is not. Quoting integers doesn't make them strings. Everything is string by default and the parser automatically infers a dtypes. – mozway Aug 18 '23 at 05:59
1

The CSV format has no notion of type, and the Standard Library csv module only returns strings (after optionaly stripping the quotes). The Pandas read_csv gently processes the CSV file that way and after splitting into fields, try to guess a type. So the quotes are normaly removed around numeric data to allow them to be converted to a numeric type.

But you can declare the expected types:

df = pd.read_csv(test, dtype= {'F4': str, 'F5': str})

It gives as expected for df:

     F1    F2   F3   F4   F5
0  blah  blah  123  123  123
1  blue  blue  456  456  NaN

and for df.dtypes:

F1    object
F2    object
F3     int64
F4    object
F5    object
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
0

To preserve the same format you need to read and write the .csv with quoting=3 (csv.QUOTE_NONE)

pd.read_csv("test.csv", quoting=3).dtypes

output:

F1    object
F2    object
F3     int64
F4    object
F5    object

and then exported with quoting=3:

df.to_csv('test_2.csv', quoting=3, index=False)

contents should be identical with the original file:

F1,F2,F3,F4,F5
"blah","blah",123,"123","123"
"blue","blue",456,"456",""
drake10k
  • 415
  • 1
  • 5
  • 21
  • 1
    Unfortunately, `quoting=3 won't work`. It just disables quoting. Try setting 0,F4 as "12,3", this should create a new unwanted column. – mozway Jan 31 '22 at 10:53
0

You can load the csv as a plain text file, then infer the dtype by the first row, the algorithm has to be custom made than (note that StringIO simulates a loaded text file and is just to show a copy paste example. Normally load the csv text directly, e.g. as described here, alternatively use pd.read_csv() without delimiters first:

import pandas as pd
from io import StringIO

txt = """F1,F2,F3,F4,F5
"blah","blah",123,"123","123"
"blue","blue",456,"456",''"""

def infer_dtype(s):

    # string
    if s.startswith('"') & s.endswith('"'):
        return 'object'

    # int
    try:
        int(s)
        return 'int'
    except ValueError:
        pass

    # float
    try:
        float(s)
        return 'float'
    except ValueError:
        pass

    # everything else
    return 'object'

def infer_col_dtypes(txt):
    h, l = txt.split('\n', 1)
    h = h.split(',')
    l = l.split('\n')[0].split(',')
    return {key:infer_dtype(val) for key,val in zip(h,l)}

dct_dtypes = infer_col_dtypes(txt)
df = pd.read_csv(StringIO(txt), sep=',', dtype=dct_dtypes)

print(df.dtypes)

# F1    object
# F2    object
# F3     int64
# F4    object
# F5    object
# dtype: object
Andreas
  • 8,694
  • 3
  • 14
  • 38