I need to scan a CSV by column and find the strongest data type per, then apply it to the entire column.
For example, if I had a CSV that looked like so (yes I not there are no commas...):
+ C1 + C2 + C3 + C4
R1 | i | s | i | f
R2 | i | f | i | i
R3 | i | i | s | f
# i = int
# f = float
# s = str
The "strongest" type for C1
would be i
, C2
would be s
, C3
would be s
and C4
would be f
.
It follows that order of "strength" is str > float > int
.
Why? Because the file type that I am writing these values to explicitly require the data type specified for a field (it's columns) to match that data type (i.e. if the field is set to FLOAT
, I cant put a str
in that column otherwise the file is invalid).
To accomplish this I am doing the following:
- For each file, read the file row by row and check each column; store "strongest" type
- Create a new container that holds newly type casted rows
Item 2 is very simple to do with a dictionary and list comprehension:
types = {header: None for header in r.fieldnames}
# read file and store "strongest" found in 'types[header]' per column
# ...
typed = [[types[header](row[header]) for header in types] for row in rows]
# note: types[header] value is a function alias (i.e. int vs int())
Item 1 is where most of the heavy lifting takes place:
for row in r: # r is a csv.DictReader
rows.append(row) # list of OrderedDicts since r is a generator
# problematic because I have to keep checking just to append...
if all(types[header] is str for header in types):
continue # all 'str' so stop checking
for header in types:
if types[header] is str:
continue # whole column can be bypassed from now on
# function just type casts 'int' or 'float' on string by ValueError
t = self.find_type(row[header])
if (types[header] is int) and (t is float):
types[header] = t # float > int since all int's can be represented as float
elif (types[header] is float) and (t is int):
pass # int < float so do nothing
else:
types[header] = t # if 'str' will be caught later by first if
The worst case for the execution of this is the number of rows in a CSV because the last row could contain a valid str
type test.
Is there a more efficient manner to do this, perhaps with pandas
(don't currently use much)?
Solution:
from numpy import issubdtype
from numpy import integer
from numpy import floating
from pandas import read_csv
from shapefile import Writer # PyShp library
df = read_csv('/some/file', low_memory = False)
rows = df.values.tolist() # fastest access over df.iterrows()
w = Writer(5, True)
# This is the core of the question
# I can access df[col].dtype but I didn't want to use str == str
# If this can be done better than subtype check let me know
for col in df:
if issubdtype(df[col], integer):
w.field(col, 'N', 20, 0)
elif issubdtype(df[col][0], floating):
w.field(col, 'F', 20, 10)
else:
w.field(col, 'C', 40, 0)
# Alternatively (1):
# from numpy import int64
# from numpy import float64
# for col in df:
# if df[col].dtype.type is int64:
# w.field(col, 'N', 20, 0)
# elif df[col].dtype.type is float64:
# w.field(col, 'F', 20, 10)
# else:
# w.field(col, 'C', 40, 0)
# Alternatively (2):
# Don't import numpy directly in namespace
# for col in df:
# if df[col].dtype == 'int64':
# w.field(col, 'N', 20, 0)
# elif df[col].dtype == 'float64':
# w.field(col, 'F', 20, 10)
# else:
# w.field(col, 'C', 40, 0)
lon = df.columns.get_loc('LON')
lat = df.columns.get_loc('LAT')
for row in rows:
w.point(row[lon], row[lat])
w.record(*row)
w.save('/some/outfile')