3

I have a several gigabyte CSV file residing in Azure Data Lake. Using Dask, I can read this file in under a minute as follows:

>>> import dask.dataframe as dd
>>> adl_path = 'adl://...'
>>> df = dd.read_csv(adl_path, storage_options={...})
>>> len(df.compute())

However, I don't want to read this into a Dask or Pandas DataFrame -- I want direct access to the underlying file. (Currently it's CSV, but I'd also like to be able to handle Parquet files.) So I am also trying to use adlfs 0.2.0:

>>> import fsspec
>>> adl = fsspec.filesystem('adl', store_name='...', tenant_id=...)
>>> lines = 0
>>> with adl.open(adl_path) as fh:
>>>    for line in fh:
>>>        lines += 1

In the same amount of time as the Dask process, this method has only read in 0.1% of the input.

I've tried using fsspec's caching, thinking that this would speed up access after the initial caching is done:

>>> fs = fsspec.filesystem("filecache", target_protocol='adl', target_options={...}, cache_storage='/tmp/files/')
>>> fs.exists(adl_path) # False
>>> fs.size(adl_path) # FileNotFoundError

>>> # Using a relative path instead of fully-qualified (FQ) path:
>>> abs_adl_path = 'absolute/path/to/my/file.csv'
>>> fs.exists(abs_adl_path) # True
>>> fs.size(abs_adl_path) # 1234567890 -- correct size in bytes
>>> fs.get(abs_adl_path, local_path) # FileNotFoundError
>>> handle = fs.open(abs_adl_path) # FileNotFoundError

Is there a performant way to read CSVs (and also Parquet) remotely as a normal Python filehandle without loading as a Dask DataFrame first?

gerrit
  • 24,025
  • 17
  • 97
  • 170
user655321
  • 1,572
  • 2
  • 16
  • 33

1 Answers1

4

I do not know why fs.get doesn't work, but please try this for the final line:

handle = fs.open(adl_path)

i.e., you open the original path, but you get a file handle to a local file (once the copy is done) somewhere in '/tmp/files/'.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Using the fully-qualified name (ie, `adl://...`) also doesn't work. I use an unqualified, absolute path in the other examples only because they _do_ work. `.exists` and `.size` both work with the absolute path but not the FQ path. The FQ path doesn't work for any methods I've tried. – user655321 Mar 12 '20 at 16:16
  • https://github.com/intake/filesystem_spec/pull/245 - please try to install from master after this is merged (shortly) – mdurant Mar 12 '20 at 18:07
  • I installed from master. Curiously, I now get `fs.size(abs_adl_path) == 4` and `next(fsspec.filesystem("filecache", target_protocol='adl', target_options={...}).open(abs_adl_file)) == b'test'`. Not sure where this `test` value is coming from - it's not in my file. – user655321 Mar 13 '20 at 18:01
  • Maybe old stuff in /tmp/files? – mdurant Mar 13 '20 at 18:38
  • There were two files in /tmp/files: `cache` contained some encoded data including the path to my file. (I assume "test" was in there.) There was also another file that contained exactly "test". However, if I delete those files and re-run, they are re-created. – user655321 Mar 13 '20 at 18:47
  • I don't know :| `adlfs.open(abs_adl_file).read()` is different? You will need some pdb, I think. – mdurant Mar 13 '20 at 18:54
  • I blew away the old file, created a new one, and that works fine: initial use of a file caches and completes in a bit over a minute. Subsequent processing is only a few seconds. Thanks! – user655321 Mar 13 '20 at 19:22