0

i do use python 3.7 to automate some processes which include using dataframe

problem i got is as follow.

using this code:

data=pd.io.parsers.read_csv(basepath + files[0],sep='|',header=None,index_col=None,dtype={'2': 'str'},skiprows=2,usecols=[2,3,10,18,17,1])

The file is so huge its impossible to track every mistake with 00 , and not all number out there are 10 char long some are 9 char long it depends.

i expect result as follows:

4   12345   abcd   P1234   A1234

but some lines in column 2 are with 00 at start dataframe automaticaly thinks it's integer and get rid of it to be efficient so sometimes it should be:

4   00123   abcd   P1234   A1234

but i end up with

4   123   abcd   P1234   A1234

so i chcek documentation to pandas and tried adding dtype it doesn't work for me. Any suggestions how to make it work?

petezurich
  • 9,280
  • 9
  • 43
  • 57
Lunamax
  • 13
  • 5

2 Answers2

2

Your combination of header=None and dtype={'2': 'str'} are problematic. When pandas parses column headers it will always use the string representation. For a file like test.csv we get

1,2.0,2,7
1,2,03,03
1,00,3,01

pd.read_csv('test.csv').columns
#Index(['1', '2.0', '2', '7'], dtype='object')

However, when specifying header=None, pandas instead creates an Int64Index:

pd.read_csv('test.csv', header=None).columns
#Int64Index([0, 1, 2, 3], dtype='int64')

So if you want the column with header '2' to be a string dtype, then you need to remove header=None, or if you just want the second column (counting from 0) we need to use the integer 2 in dtype.

pd.read_csv('test.csv', header=None, dtype={2: 'str'})
#   0    1   2  3
#0  1  2.0   2  7
#1  1  2.0  03  3
#2  1  0.0   3  1

pd.read_csv('test.csv', dtype={'2': 'str'})
#   1  2.0   2  7   # <- This row now string column headers
#0  1    2  03  3
#1  1    0   3  1
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • wow thank's its awesome, new info but doesn't it take that 1st line of csv file then as its header? so the name of column 2 will be the whatever data is in 1st row of 2nd column? which to me is a loss data? or am i wrong? i do wanted whole column to be string not jsut header if i understand what you wrote correctly – Lunamax Dec 19 '19 at 20:16
  • Not if you specify `header=None` That basically assumes there is absolutely no header, so pandas creates a RangeIndex and uses whatever data you have as the first row in your DataFrame – ALollz Dec 19 '19 at 20:19
  • so when i do ```header=None``` basicaly index the columns from 0 to wheveter is the last column but if don't use it, it will create header from the 1st row, is it correct? because im trying to automate my work when i ahve to work with like 350 files and when everyone of them contain about 1000 - 270k rows with aprox 40 column's and changes daily. I would have to specify column 2 for everyone of them – Lunamax Dec 19 '19 at 20:26
  • Yes. But it gets complicated with `skiprows`. The order is that first the rows are skipped, then if `header='infer'` (the default) it will use the first row after the skips as the column labels. If `header=None` it will always create an IntIndex from 0 to however many columns you have, and skiprows doesn't change that. – ALollz Dec 19 '19 at 20:29
  • problem is that i have no other choice but use ```skiprows``` because its jsut header contain som data for later use, but it about 10 - 12 columns not related to data, and raw data itself are far above 30 columns so dataframe doesnt go along and throw error by passing header max limit – Lunamax Dec 19 '19 at 20:32
0

i bypassed the issue by as follow.

    data=pd.io.parsers.read_csv(basepath + files[0],sep='|',header=None,index_col=None,dtype='str',skiprows=2,usecols=[2,3,10,18,17,1]) # index_col=0,)

    data[10] = pd.to_numeric(data[10])

credits to: https://stackoverflow.com/a/28648923/10628703 (Alex Riley)

Lunamax
  • 13
  • 5