0

Reading CSV or Parquet files from local fs is very easy, but it seems that arrow does not support reading files from a remote server given its ip. Is there a way to achieve this? e.g. read a subset columns of a Parquet file from a remote server (path is like "ip://path/to/remote/file"). Thanks.

Raining.
  • 11
  • 2
  • If your remote server exposes an S3 compatible API then you can use the S3 filesystem. – Pace Feb 17 '22 at 19:11

2 Answers2

1

pyarrow.dataset.dataset() has a filesystem argument through which it supports many remote file systems.

See the Arrow documentation for file systems. An fsspec file system can also be passed in, of which there are very many.

For example, if your Parquet file is sitting on a web server, you could use the fsspec HTTP file system:

import pyarrow.dataset as ds                                                                                                                                                          
import fsspec.implementations.http
http = fsspec.implementations.http.HTTPFileSystem()
d = ds.dataset('http://localhost:8000/test.parquet', filesystem=http)
Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
0

There is an open issue for this if you would like to contribute or follow development: https://issues.apache.org/jira/browse/ARROW-7594

(By 'remote server' I assume you mean over HTTP(s) or similar. If you're looking for a custom client-server protocol, check out Arrow Flight.)

li.davidm
  • 11,736
  • 4
  • 29
  • 31
  • Thanks! Yes it can be over HTTP(s) and formats the file (csv/parquet) into arrow at client side, different from flight which does this at file-server side. For parquet this can potentially reduce data transfer as it's compressed during transfer. – Raining. Feb 18 '22 at 03:49
  • It can be done through the `fsspec` HTTP file system. (I've added an example for this.) Do you know if there is any advantage in implementing ARROW-7594 over using `fsspec`? – Daniel Darabos Sep 06 '22 at 06:32
  • A 'native' Arrow filesystem doesn't have to call back into Python. For things like Dataset which do parallel I/O, this can avoid potential bottlenecking on the GIL. (Also, as an Arrow maintainer, a 'native' filesystem would be accessible to R as well.) – li.davidm Sep 06 '22 at 11:36