0

I have been dealing with this problem for a week. I use the command

from dask import dataframe as ddf
ddf.read_parquet("http://IP:port/webhdfs/v1/user/...")

I got invalid parquet magic. However ddf.read_parquet is Ok with "webhdfs://"

I would like the ddf.read_parquet works for http because I want to use it in dask-ssh cluster for workers without hdfs access.

James Z
  • 12,209
  • 10
  • 24
  • 44

1 Answers1

0

Although the comments already partly answer this question, I thought I would add some information as an answer

  • HTTP(S) is supported by dask (actually fsspec) as a backend filesystem; but to get partitioning within a file, you need to get the size of that file, and to resolve globs, you need to be able to get a list of links, neither of which are necessarily provided by any given server
  • webHDFS (or indeed httpFS) don't work like HTTP downloads, you need to use a specific API to open a file and fetch a final URL on a cluster member to that file; so the two methods are not interchangeable
  • webHDFS is normally intended for use outside of the hadoop cluster; within the cluster, you would probably use plain HDFS ("hdfs://"). However, kerberos-secured webHDFS can be tricky to work with, depending on how the security was set up.
mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Wow! I am only a python user. I thought that it would be easy if I could study more about the HTTP server. Do multiple partitions of a file (part.0.parquet, part.1.parquet ,...) cause the problem if I understood? I don't know how different partitions within a part.0.parquet are managed by a parquet engine like fastparquet. If I run.to_parquet(write_meta=False) for a large self-contained file part.0.parquet, still is it possible or not to read it from the HTTP server? thanks for the maintainers of fastparquet library I found a missed requirements of pyarrow (lz4 ) in the project website. – Yousef Oleyaeimotlagh Jun 01 '20 at 20:33
  • Simply: the webHDFS just doesn't work in a way that you can use the HTTP file-system, it's more specialist, sorry. – mdurant Jun 01 '20 at 20:52