2

I would like pandas read_csv to properly read the following example text into a DataFrame:

"INDEX"|"COLUMN_STRING"|"COLUMN_INTEGER"|"COLUMN_EMPTY"|"COLUMN_EMPTY_STRING" 1|"string"|21||""

The file I need to parse has all the values that should be strings wraped with "". Values that should be NaN are without double quotes, like that: ||

I would like read_csv to keep all the "quoted" values as strings, also "", but it forces NaN as a default value for "".

If I use keep_default_na=False, it sets empty strings '' to both || and |""|.

Also, using dtype={"COLUMN_EMPTY_STRING": str} doesn't help.
Does anybody know the solution to this pickle?

Another possible solution, would be to use quoting=3. This would keep strings as "string", which could be solved after parsing. I cannot use it though, since I'm providing index_col argument, which raises error since it cannot find e.g. INDEX, because it reads "INDEX" from the file.

user3055017
  • 67
  • 1
  • 9

2 Answers2

0

Depending on the size of the file, one solution can be to pre-process the file in memory to replace all the occurrences of || with |NaN| using python io.StringIO

import io
from pandas.io.parsers import read_csv

with open( 'test.csv', 'r' )  as fin:
    with io.StringIO( fin.read( ).replace( '||', '|NaN|' )) as ios:
        df = read_csv( ios, na_values='NaN', sep='|',
                       index_col=["INDEX"], keep_default_na=False )

this is how df looks like:

      COLUMN_STRING  COLUMN_INTEGER  COLUMN_EMPTY COLUMN_EMPTY_STRING
INDEX                                                                
1            string              21           NaN                    

to verify:

print ( 'empty: >{}<, empty string: >{}<'.format( df.COLUMN_EMPTY[ 1 ], 
                                                  df.COLUMN_EMPTY_STRING[ 1 ] ))
empty: >nan<, empty string: ><
behzad.nouri
  • 74,723
  • 18
  • 126
  • 124
  • I tried your solution, but it assigns type int64 to a value like that `|"12"|`. I would like it to keep it as a string. But I agree on the part where you change `||` to `|NaN|`, although would like to find a solution without additional string converting. – user3055017 Dec 02 '13 at 10:32
  • @user3055017 i think if all the values in a column are quoted numbers, the column is converted to a number type; you may either specify `dtype={"COLUMN_STRING": str}` as one of the arguments, or do `df.COLUMN_STRING = df.COLUMN_STRING.astype( str )` after reading the data – behzad.nouri Dec 02 '13 at 18:47
0

There are three separate things going on in this question:

  1. Separation of columns using the pipe symbol, parameter: pandas.read_csv( sep='|' )
  2. Quote character being used is double quotes, parameter: pandas.read_csv( quotechar='"' )
  3. Data type for columns needs to be string or number, parameter: dtype={"COLUMN_EMPTY_STRING": 'string'} will force all values to primitive string including empty values
dank8
  • 361
  • 4
  • 20