4

I'm attempting to use Intake to catalog a csv dataset. It uses the Dask implementation of read_csv which in turn uses the pandas implementation.

The issue I'm seeing is that the csv files I'm loading don't have an index column so Dask is interpreting the first column to be the index and then shifting the column names to the right.

An example:enter image description here

The datetime (dt) column is supposed to be the first column but when the csv is read, it is interpreted to be the index and the column names are shifted and therefore offset from their proper place. I'm supplying the column names list and dtypes dictionary into the read_csv call.

As far as I can tell, if I were using pandas I would supply the index_col=False kwarg to fix as illustrated here, but Dask returns an intentional error stating that: Keywords 'index' and 'index_col' not supported. Use dd.read_csv(...).set_index('my-index') instead. This seems to be due to a parallelization limitation.

The suggested fix (using set_index('my-index)) isn't effective in this case because it expects the whole file to be read while also having column names to set the index. The main issue being, I can't accurately set the index column if the name is offset.

What is the best way, in Dask, to load a csv that doesn't explicitly have an index column such that the interpreted index column at least retains the specified column name?

More information:

The play dataset I'm using: https://www.kaggle.com/NUFORC/ufo-sightings?select=scrubbed.csv

The Intake catalog.yml file I'm using is the following:

name:
  intake-explore-catalog
metadata:
  version: 1
sources:
    ufo_sightings:
      description: data around ufo sightings
      driver: csv
      args:
        urlpath: "{{CATALOG_DIR}}/data/ufo_scrubbed.csv"
        csv_kwargs:
          header: 0
          names: ['dt', 'city', 'state', 'country', 'shape', 'duration_s', 'duration_hm', 'comments', 'date_posted', 'latitude']
          dtype: {'dt': 'str', 'city': 'str', 'state': 'str', 'country': 'str', 'shape': 'str', 'duration_s': 'str', 'duration_hm': 'str', 'comments': 'str', 'date_posted': 'str', 'latitude': 'str'}
          infer_datetime_format: true
      metadata:
        version: 1
        custom_field: blah

I'm loading the catalog and corresponding dataset using the following:

cat = intake.open_catalog("catalog.yml")
ufo_ds = cat.ufo_sightings.read()

This results in the read in dataframe shown above and a csv copy of that data:

,dt,city,state,country,shape,duration_s,duration_hm,comments,date_posted,latitude
10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church. The Baptist Church sit,4/27/2004,29.8830556,-97.9411111
10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX.  Lights racing across the sky & making 90 degree turns on a dime.,12/16/2005,29.38421,-98.581082
10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 England,1/21/2008,53.2,-2.916667
10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving the only Edna theater at about 9 PM&#44...we had our bikes and I took a different route home,1/17/2004,28.9783333,-96.6458333
10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/attack aircraft on a solo night exercise&#44 I was at 50&#44000&#39 in a "clean" aircraft (no ordinan,1/22/2004,21.4180556,-157.8036111

Compared to the original/raw data csv (no leading comma):

datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude 
10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,"This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church. The Baptist Church sit",4/27/2004,29.8830556,-97.9411111
10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,"1949 Lackland AFB&#44 TX.  Lights racing across the sky & making 90 degree turns on a dime.",12/16/2005,29.38421,-98.581082
10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,"Green/Orange circular disc over Chester&#44 England",1/21/2008,53.2,-2.916667
10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,"My older brother and twin sister were leaving the only Edna theater at about 9 PM&#44...we had our bikes and I took a different route home",1/17/2004,28.9783333,-96.6458333
10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,"AS a Marine 1st Lt. flying an FJ4B fighter/attack aircraft on a solo night exercise&#44 I was at 50&#44000&#39 in a "clean" aircraft (no ordinan",1/22/2004,21.4180556,-157.8036111
10/10/1961 19:00,bristol,tn,us,sphere,300,5 minutes,"My father is now 89 my brother 52 the girl with us now 51 myself 49 and the other fellow which worked with my father if he&#39s still livi",4/27/2007,36.5950000,-82.1888889

Dask Invocation:

df = dask.dataframe.read_csv('data/ufo_scrubbed.csv',
                            names=['dt',
                                   'city',
                                   'state',
                                   'country',
                                   'shape',
                                   'duration_s',
                                   'duration_hm',
                                   'comments',
                                   'date_posted',
                                   'latitude'],
                             dtype = {'dt': 'str',
                                       'city': 'str',
                                       'state': 'str',
                                       'country': 'str',
                                       'shape': 'str',
                                       'duration_s': 'str',
                                       'duration_hm': 'str',
                                       'comments': 'str',
                                       'date_posted': 'str',
                                       'latitude': 'str'}
                            )
Brenton
  • 85
  • 1
  • 8
  • 1
    Can you please post the intake block you are using, and the equivalent `dd.read_csv` invocation? A snippet of your raw data would be nice too. – mdurant Dec 11 '20 at 17:19
  • 1
    Sorry, ideally the CSV file contents, as copy/pastable text rather than an image - to make answerers' lives easy. – mdurant Dec 11 '20 at 18:26
  • I added a bit more information, Let me know if you're looking for more. In the resulting csv of the read in raw data shows the extra unnamed column causing the shift – Brenton Dec 11 '20 at 18:32
  • I also added the dask invocation for the read_csv function – Brenton Dec 11 '20 at 19:08

1 Answers1

2

Unfortunately, the header line begins with a comma, which is why your column names are off by one. You would be best off fixing that, rather than working around it.

However, you do not get an index automatically if you don't supply column names:

df = dask.dataframe.read_csv('file.csv', header=0)

here the index is just a range (counting from 0 in each partition). You could then assign column names after the fact

df2 = df.rename(columns=dict(zip(df.columns, df.columns[1:]), latitude='longitude')) 

You would not be able to achieve this with an Intake prescription alone, you would have to grab the dataframe via to_dask() or read() (for dask or pandas output, respectively).

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • well, that's embarrassing that I didn't notice that. smh Thank you for taking the time to look into it! – Brenton Dec 12 '20 at 00:09
  • I took another look at the raw data, the header for that doesn't appear to lead with a comma. That leading comma seems to get introduced by the `read_csv` function call. I think the issue is that the column name list I provided was not the same length as the number of columns in the data – Brenton Dec 14 '20 at 14:05
  • You mean read_csv followed by to_csv? If you can make a nice compact reproducer, this sounds like a bug you could post. – mdurant Dec 14 '20 at 14:26
  • Yes, that's the behavior I noticed going through this. That's a good point and I can definitely try to make a comprehensive bug report. Thanks again for your time and help – Brenton Dec 14 '20 at 18:23