Earlier we were using pyarrow 0.13.0 version with python 3.7.16 version and through hdfs.connect we were able to establish connection with hdfs. As python version got upgraded to 3.9.10 we have to upgrade our pyarrow to 12.0.1 version ( hdfs.connect if deprecated in this version) . So we are going with fs.HadoopFileSystem and setting up CLASSPATH and ARROW_LIBHDFS_DIR and reading parquet files lying in a hdfs directory. But getting error: Error creating dataset. Could not read schema from'hdfs_file_location'. Is this a parquet file? Opening hdfs file failed.
Code:
import subprocess
hadoop_bin = os.path.normpath(os.environ['HADOOP_HOME'])
hadoop_bin_exe = os.path.join(hadoop_bin, 'bin/hadoop')
CLASSPATH = subprocess.check_output([hadoop_bin_exe, 'classpath', '--glob'])
is
environ['CLASSPATH'] = CLASSPATH.decode('utf-8')
os.environ['AROW_LIBHDFS_DIR'] = 'PATH'
hdfs = pyarrow.fs.HadoopFileSystem(host=f"""{host}""", port=10000)
df_demo = pq.read_table(f"""{hdfs_path}""" /)
df_demo.to_pandas()
In pyspark debug shell mode the code is working fine but causing issue in spark-submit cluster mode. Please assist.
/opt/cloudera/parcels/con-7.1.8-1.edh7.1.8.p21.38608716/lib/hadoop/bin/hadoop SLF4J: Class path contains multiple SLF4J bindings. 3
SLF4J: Found binding in
[jar:file:/opt/cloudera/parcels/SPARK3-3.3.0.3.3.7180.9-1-1.p0.36726587/1ib/spark/jars/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/statieLoggerminder.class) SLF4J: Found binding in jar:file:/opt/cloudera/parcels/CDH-7.1.8-1.cdh7.1.8.p21.38608716/jars/sit43-reload4j-1.7.36.jar!/org/sittj/impl/StaticLoggerBinder.class)
now
163 SLF4J: See http://www.a1141.org/codes.html#multiple bindings for an explanation. SLF4J: Actual binding is of type (org.apache.logging.slf4j.Log4jLoggerFactory]
H connection with hdfs extablished successfully 9 23/07/03 09:26:11 WARN ipc.Client: [main]: Exception encountered while connecting to the server :
org.apache.hadoop.ipe. RemoteException (org.apache.hadoop.ipe.StandbyException): Operation category READ is not supported in state standby. Visit
https://s.apache.org/aban-error
20 hafsopenFile(/haas/drm/db/drm dev1/datafiles/drmodel/hourly_dr27/part-00000-1fdcd16b-88c0-4e80-9a04-573d47ecdacb-c000. snappy.parquet): FileSystemtopen((Lorg/apache/hadoop/ts/Path: 1) Lorg/apache/hadoop/fa/ESDataInputStream;) error:
InvalidKeyException: Illegal key sizejava.io.IOException: java.security. InvalidkeуException: Illegal key size at org.apache.hadoop.crypto.JoeAesCtrCryptoCodec$JceAesCtrCipher, init (JceAescErCryptoCodec.java:116)
at org.apache.hadoop.crypto.CryptoInputStream.updateDecryptor (CryptoInputStream.java:301)
at org.apache.hadoop.crypto.CryptoInputStream.resetstreamoffset (CryptoInputStream.java:314 at org.apache.hadoop.crypto.cryptoInputStream.(CryptoInputStream.java:139)
at org.apache.hadoop.crypto.CryptoInputStream.(CryptoInputStream.java:120) at org.apache.hadoop.crypto. CryptoInputStream.(CryptoInputStream.java:144)
at org.apache.hadoop.hdfs.hdfsKMSUtil.createWrappedInputStream(Hdf KMSUtil.java:198) at org.apache.hadoop.hdfs.DFSClient.createwrappedInputStream (DFSClient.java:947)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall (Distributedrilesystem.java:341) at org.apache.hadoop.hdfa.DistributedFileSystems 4.doCall (DistributedFilesystem.java:335)
at org.apache.hadoop.fs.FilesystemLinkResolver.resolve (FilesystemLinkResolver.java:81)
at org.apache.hadoop.hdfs. DistributedFileSystem.open(DistributedFilesystem.java:352)
Caused by: java.security.InvalidkeyException: Illegal key size
at javax.crypto, Cipher.checkCryptoPerm(Cipher.java:1039) at javax. crypto.Cipher.implinit (Cipher.java:805)
at javax.crypto.Cipher.chooseProvider (Cipher.java:864) at javax.crypto.Cipher.init (Cipher.java:1396)
at javax.crypto.Cipher.init(Cipher.java:1327)
at org.apache.hadoop.crypto. JcenesCtrCryptoCodec5JceñestrCipher. init (JcenesCtrCryptoCodec.java:113)
**
exception reading parquet: [Errno 255] Error creating dataset. Could not read schema from /haas/drm/db/drm_devi/datafiles/armodel/hourly at27/part-00000-11dcdieb-89c0-4e80-9d04-573d47ecdact-c000.snappy.parquet, Is this a 'parquet tile?: Opening H ride /haas/drm/db/drm_devi/datafiles/drmodel/hourly_df27/part-00000-1rded16b-80c0-4800-9804-573047ecdacb-c000.snappy parquet failed. Detail: (errno 255) know error 255
**