1

I had correct parquet file (I am 100% sure) and only one file in this directory v3io://projects/risk/FeatureStore/ptp/parquet/sets/ptp/1681296898546_70/. I got this generic error AnalysisException: Unable to infer schema ... during read operation, see full error detail:

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<ipython-input-26-5beebfd65378> in <module>
      1 #error
----> 2 new_DF=spark.read.parquet("v3io://projects/risk/FeatureStore/ptp/parquet/")
      3 new_DF.show()
      4 
      5 spark.close()

/spark/python/pyspark/sql/readwriter.py in parquet(self, *paths, **options)
    299                        int96RebaseMode=int96RebaseMode)
    300 
--> 301         return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
    302 
    303     def text(self, paths, wholetext=False, lineSep=None, pathGlobFilter=None,

/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1320         answer = self.gateway_client.send_command(command)
   1321         return_value = get_return_value(
-> 1322             answer, self.gateway_client, self.target_id, self.name)
   1323 
   1324         for temp_arg in temp_args:

/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    115                 # Hide where the exception came from that shows a non-Pythonic
    116                 # JVM exception message.
--> 117                 raise converted from None
    118             else:
    119                 raise

AnalysisException: Unable to infer schema for Parquet. It must be specified manually.

I used this code:

new_DF=spark.read.parquet("v3io://projects/risk/FeatureStore/ptp/parquet/")
new_DF.show()

strange is, that it worked correctly, when I used full path to the parquet file:

new_DF=spark.read.parquet("v3io://projects/risk/FeatureStore/ptp/parquet/sets/ptp/1681296898546_70/")
new_DF.show()

Did you have similar issue?

XiongChan
  • 90
  • 12
JIST
  • 1,139
  • 2
  • 8
  • 30
  • 1
    it seems that not all parquet files under v3io://projects/risk/FeatureStore/ptp/parquet/ have the same schema – Abdennacer Lachiheb Apr 14 '23 at 21:43
  • I would agree with @AbdennacerLachiheb, this is an error that you are likely to get if not all files have the same schema. Check the files again. – Zer0k Apr 14 '23 at 23:00
  • There is only one file with valid contect and it is strange – JIST Apr 15 '23 at 01:38

3 Answers3

1

The error is happening because the parquet file is not in "v3io://projects/risk/FeatureStore/ptp/parquet/" folder, but is in "v3io://projects/risk/FeatureStore/ptp/parquet/sets/ptp/1681296898546_70/" folder.

This will work:

new_DF=spark.read.parquet("v3io://projects/risk/FeatureStore/ptp/parquet/*/*/*")
new_DF.show()

The * syntax reads everything in the directory.

For more info about mass reading files with spark.read checkout this question: Regex for date between start- and end-date

SamJ
  • 123
  • 7
1

Normally you should never see this error when reading a parquet because the schema is already in the parquet file, in my experience here are some reasons why you see this error:

  1. The parquet you are trying to read is empty, for some strange reason spark throw this error when the parquet is empty, however this is doesn't seem the problem since you have a parquet file under "v3io://projects/risk/FeatureStore/ptp/parquet/sets/ptp/1681296898546_70/"

  2. You may have parquet files with different schema under the path you are trying to read, for example

    /path/
         partition=value1/
             part-000...snappy.parquet
         partition=value2/
             part-000...snappy.parquet 
    

    If the 2 parquets doesn't have the same schema, you will get this error.

  3. Under the path "v3io://projects/risk/FeatureStore/ptp/parquet/" for all subdirectories you may have files other than parquet files.

Even though the problem could be listed above, it maybe another problem because there's no enough information about the files under "v3io://projects/risk/FeatureStore/ptp/parquet/", I will suggest that you add to your post the whole tree under that path along with the schema for all parquet files.

Update

Looking closer to your path, I see the problem, this path

sets/ptp/1681296898546_70

Should be like this

Patition1=sets/Patition2=ptp/Patition3=1681296898546_70

In spark you either point directly to your parquet or to a partition directory, you cannot point to a path that neither a parquet content neither a partition.

Abdennacer Lachiheb
  • 4,388
  • 7
  • 30
  • 61
  • There is only one parquet file (with valid content) in whole directory and sub-directories (it is strange situation). – JIST Apr 15 '23 at 14:32
0

This part of code (with recursiveFileLookup=true) helps to solve the issue also:

new_DF=spark.read.option("recursiveFileLookup","true").parquet("v3io://projects/risk/FeatureStore/ptp/parquet/")
new_DF.show()
JIST
  • 1,139
  • 2
  • 8
  • 30