2

I need to scan a CSV by column and find the strongest data type per, then apply it to the entire column.

For example, if I had a CSV that looked like so (yes I not there are no commas...):

    + C1 + C2 + C3 + C4
R1  | i  | s  | i  | f
R2  | i  | f  | i  | i
R3  | i  | i  | s  | f

# i = int
# f = float
# s = str

The "strongest" type for C1 would be i, C2 would be s, C3 would be s and C4 would be f.

It follows that order of "strength" is str > float > int.

Why? Because the file type that I am writing these values to explicitly require the data type specified for a field (it's columns) to match that data type (i.e. if the field is set to FLOAT, I cant put a str in that column otherwise the file is invalid).

To accomplish this I am doing the following:

  1. For each file, read the file row by row and check each column; store "strongest" type
  2. Create a new container that holds newly type casted rows

Item 2 is very simple to do with a dictionary and list comprehension:

types = {header: None for header in r.fieldnames}
# read file and store "strongest" found in 'types[header]' per column
# ...
typed = [[types[header](row[header]) for header in types] for row in rows]
# note: types[header] value is a function alias (i.e. int vs int())

Item 1 is where most of the heavy lifting takes place:

for row in r: # r is a csv.DictReader
    rows.append(row) # list of OrderedDicts since r is a generator
    # problematic because I have to keep checking just to append...
    if all(types[header] is str for header in types):
        continue # all 'str' so stop checking

    for header in types:
        if types[header] is str:
            continue # whole column can be bypassed from now on

        # function just type casts 'int' or 'float' on string by ValueError
        t = self.find_type(row[header])
        if (types[header] is int) and (t is float):
            types[header] = t # float > int since all int's can be represented as float
        elif (types[header] is float) and (t is int):
            pass # int < float so do nothing
        else:
            types[header] = t # if 'str' will be caught later by first if

The worst case for the execution of this is the number of rows in a CSV because the last row could contain a valid str type test.

Is there a more efficient manner to do this, perhaps with pandas (don't currently use much)?

Solution:

from numpy import issubdtype
from numpy import integer
from numpy import floating
from pandas import read_csv
from shapefile import Writer # PyShp library

df = read_csv('/some/file', low_memory = False)

rows = df.values.tolist() # fastest access over df.iterrows()

w = Writer(5, True)

# This is the core of the question
# I can access df[col].dtype but I didn't want to use str == str
# If this can be done better than subtype check let me know
for col in df:
    if issubdtype(df[col], integer): 
        w.field(col, 'N', 20, 0)
    elif issubdtype(df[col][0], floating):
        w.field(col, 'F', 20, 10)
    else:
        w.field(col, 'C', 40, 0)

# Alternatively (1):
# from numpy import int64
# from numpy import float64
# for col in df:
#     if df[col].dtype.type is int64: 
#         w.field(col, 'N', 20, 0)
#     elif df[col].dtype.type is float64:
#         w.field(col, 'F', 20, 10)
#     else:
#         w.field(col, 'C', 40, 0)

# Alternatively (2):
# Don't import numpy directly in namespace
# for col in df:
#     if df[col].dtype == 'int64': 
#         w.field(col, 'N', 20, 0)
#     elif df[col].dtype == 'float64':
#         w.field(col, 'F', 20, 10)
#     else:
#         w.field(col, 'C', 40, 0)


lon = df.columns.get_loc('LON')
lat = df.columns.get_loc('LAT')

for row in rows:
    w.point(row[lon], row[lat])
    w.record(*row)

w.save('/some/outfile')
pstatix
  • 3,611
  • 4
  • 18
  • 40
  • Umm.... shouldn't `continue # all 'str' so stop checking` be `break` so you do actually stop checking? – Jon Clements Apr 25 '18 at 14:28
  • for this to be considered as CSV, there has to be a unique separator in all rows. in youy headers you have '+' as separator, whereas in other rows it is '|'. Neither pandas not CSV module, will be be able to read this as is – akshat Apr 25 '18 at 14:29
  • @JonClements Yes, it should, but thats a flaw I have right now. Otherwise `rows` will not contain the remaining rows from the `DictReader`. So I simply skip the headers check and keep copying the yielded `OrderedDicts`. – pstatix Apr 25 '18 at 14:29
  • @akshat It is a CSV, that is just a visual representation. Hence my comment in the body `"(yes I not there are no commas...)"`. If I threw the file into Excel to visualize, there would be no (visual) unique separators. It's just an aid to help you see my problem set. – pstatix Apr 25 '18 at 14:30
  • how big is your csv? – Patrick Artner Apr 25 '18 at 14:31
  • @PatrickArtner Sometimes 300 rows, sometimes 1.5 million. – pstatix Apr 25 '18 at 14:31
  • @JonClements If `csv.DictReader` supported indexing I would happily replace that `continue` and preface it with `rows.extend(r[i:])` but the object doesnt support `len()` to run something like `for i in range(len(r))`. If you try to convert to `list()`, you just get the first element of each index in the object. Does me no good! – pstatix Apr 25 '18 at 14:33
  • @pstatix well - you just do `rows.extend(r)` to consume the remainder of the file then `break` no ? – Jon Clements Apr 25 '18 at 14:35
  • @JonClements I had no idea that `extend()` would exhaust the generator...the only draw I still have with that is I have to check it each time (could be a lot of calls to `all()`). – pstatix Apr 25 '18 at 14:40

1 Answers1

2

An example DataFrame:

In [11]: df
Out[11]:
    C1  C2 C3    C4
R1   1   a  6   8.0
R2   2  4.  7   9.0
R3   3   5  b  10.0

I wouldn't try and be clever with any short-circuit evaluation. I'd just take the type of every entry:

In [12]: df_types = df.applymap(type)

In [13]: df_types
Out[13]:
               C1             C2             C3               C4
R1  <class 'int'>  <class 'str'>  <class 'str'>  <class 'float'>
R2  <class 'int'>  <class 'str'>  <class 'str'>  <class 'float'>
R3  <class 'int'>  <class 'str'>  <class 'str'>  <class 'float'>

If you enumerate these types you can use max:

In [14]: d = {ch: i for i, ch in enumerate([int, float, str])}

In [15]: d_inv = {i: ch for i, ch in enumerate([int, float, str])}

In [16]: df_types.applymap(d.get)
Out[16]:
    C1  C2  C3  C4
R1   0   2   2   1
R2   0   2   2   1
R3   0   2   2   1

In [17]: df_types.applymap(d.get).max()
Out[17]:
C1    0
C2    2
C3    2
C4    1
dtype: int64

In [18]: df_types.applymap(d.get).max().apply(d_inv.get)
Out[18]:
C1      <class 'int'>
C2      <class 'str'>
C3      <class 'str'>
C4    <class 'float'>
dtype: object

Now, you can iterate though each column and update it in df (to the max):

In [21]: for col, typ in df_types.applymap(d.get).max().apply(d_inv.get).iteritems():
             df[col] = df[col].astype(typ)


In [22]: df
Out[22]:
    C1  C2 C3    C4
R1   1   a  6   8.0
R2   2  4.  7   9.0
R3   3   5  b  10.0

In [23]: df.dtypes
Out[23]:
C1      int64
C2     object
C3     object
C4    float64
dtype: object

This could be made slightly more efficient, if you have many columns by grouping by the type and updating batches of columns (e.g. all the string columns at once).

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Although actually in pandas, when you read this in, it will convert `int` to `int64` columns and `float` to `float64`, so you only have to worry about the `object` columns – Andy Hayden Apr 25 '18 at 14:44
  • The OPs source data is CSV... so I imagine just using `.read_csv(..., low_memory=False)` and then accessing the resulting frame's `.dtypes` and consider `object` as `str` would be fine... – Jon Clements Apr 25 '18 at 14:46
  • My limited pandas experience may hinder me here. But a CSV is nothing but string values when read in using the `csv` module. So a row that looks like `1, 2, 3, 4.5, abc, 1.1` is really `'1', '2', '3', '4.5', 'abc', '1.1'`. Does pandas _automagically_ type test upon `.read_csv()`? – pstatix Apr 25 '18 at 14:55
  • 1
    @pstatix pandas' read_csv tries to infer types - similar to what you're doing by hand... In fact - it will apply the same precedence as you are... anything that's a string remains as a string and floats trump ints. – Jon Clements Apr 25 '18 at 15:03
  • One thing it won't do is convert `[1, 2, 'c']` into `['1', '2', 'c']` i.e. object columns won't be converted to string necessarily. – Andy Hayden Apr 25 '18 at 15:09
  • @AndyHayden Thats fine, I can check the type and map the conversion myself later. Its just the time spent checking for "strongest" type that I wanted to resolve. – pstatix Apr 25 '18 at 15:13
  • @JonClements Interestingly enough, this is about 20 seconds slower than my program with its optimizations. I found this to be about 20% the number of lines I used, but surprised it is slower. – pstatix Apr 25 '18 at 17:38
  • @pstatix can't say I'm massively surprised... There's going to be overhead involved when you're bringing a DataFrame into the picture vs a list of dicts... – Jon Clements Apr 25 '18 at 17:48
  • @JonClements The whole point was to find a better way! Albeit this solution is very graceful, I don't really need the DataFrame objects, so the overhead isn't worth it. If it were monumentally faster (30% or so) I would use it, but the overhead of mapping the appropriate types by column is faster by comprehension currently. – pstatix Apr 25 '18 at 17:50
  • @pstatix what code are you running anyway... You literally only need two lines - a pd.read_csv and then access the resulting frame's dtypes attribute... – Jon Clements Apr 25 '18 at 17:52
  • @JonClements Yes I use `pd.read_csv`, I had then been using `df[col] = df[col].astype(typ)` to map things like `float64` to `float` for use in the rest of mode code. This is because `isinstance(dtype, float)` would return false until after `astype` which converted to `numpy.float64` which is a subclass of `float`. – pstatix Apr 25 '18 at 18:00
  • @pstatix err... So it sounds like you're doing a lot more work than your example code (not to mention pandas has already coerced your columns into types anyway). Not quite sure why you're then reassigning those columns as other types again? – Jon Clements Apr 25 '18 at 18:06
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/169791/discussion-between-pstatix-and-jon-clements). – pstatix Apr 25 '18 at 18:08
  • @AndyHayden Only draw is when I try to do something like `isinstance(df['UNQ'].dtype, float)`, returns false. `df['UNQ'].dtype` returns `dtype('float64')`. It appears that is a `str` rather than the type, which is a `pandas.core.series.Series`. – pstatix Apr 25 '18 at 18:58
  • @JonClements Added solution, take a look at comment block in snippet – pstatix Apr 26 '18 at 15:02