I am trying to do some file operations of hdfs directly within a Pyspark Script. In particular I want to check
- does a path or a file exits (
org.apache.hadoop.fs.FileSystem
) --> ok - create folders/delete folders (
org.apache.hadoop.fs.FileSystem
) --> ok - move files from one path to another (
org.apache.hadoop.fs.FileUtil
)--> FAILS - WHY
So my issue is only in the use of the last class My Code
# get all the jvm Objects
spark = SparkSession.builder.getOrCreate()
hadoopPath = spark._jvm.org.apache.hadoop.fs.Path
hadoopConfiguration = spark._jsc.hadoopConfiguration()
hadoopFs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(hadoopConfiguration)
# this import does not throw an error so far
hadoopFu = spark._jvm.org.apache.hadoop.fs.FileUtil
the following functions working like charme
path1="hdfs://mycluster/somefolder"
path2="hdfs://mycluster/newfolder"
# check if folder exits and contains parquet files (returns list with or without files)
hadoopFs.globStatus(hadoopPath(path + "/*.parquet"))
# alternativ check only if path exists (returns true, false)
hadoopFs.exists(hadoopPath(path))
# create folder on hdfs
hadoopFs.mkdirs(hadoopPath(path2))
But as soon as I want to access methods from FileUtil
there are not found. But I do not understand why, they should be part of standard hadoop/spark libraries
https://hadoop.apache.org/docs/r3.3.5/api/org/apache/hadoop/fs/FileUtil.html
The following commands are failing
hadoopFu.list(path1)
Py4JError: An error occurred while calling z:org.apache.hadoop.fs.FileUtil.list. Trace:
py4j.Py4JException: Method list([class java.lang.String]) does not exist
also
hadoopFu.copy(path1,path2)
Py4JError: An error occurred while calling z:org.apache.hadoop.fs.FileUtil.copy. Trace:
py4j.Py4JException: Method copy([class java.lang.String, class java.lang.String]) does not exist
Why are they not found? I don't understand what I am doing wrong. Important, I want to do this within Python/Pyspark wrapper Thanks Alex