Valid parquet file, but error with parquet schema

Question

I had correct parquet file (I am 100% sure) and only one file in this directory v3io://projects/risk/FeatureStore/ptp/parquet/sets/ptp/1681296898546_70/. I got this generic error AnalysisException: Unable to infer schema ... during read operation, see full error detail:

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<ipython-input-26-5beebfd65378> in <module>
      1 #error
----> 2 new_DF=spark.read.parquet("v3io://projects/risk/FeatureStore/ptp/parquet/")
      3 new_DF.show()
      4 
      5 spark.close()

/spark/python/pyspark/sql/readwriter.py in parquet(self, *paths, **options)
    299                        int96RebaseMode=int96RebaseMode)
    300 
--> 301         return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
    302 
    303     def text(self, paths, wholetext=False, lineSep=None, pathGlobFilter=None,

/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1320         answer = self.gateway_client.send_command(command)
   1321         return_value = get_return_value(
-> 1322             answer, self.gateway_client, self.target_id, self.name)
   1323 
   1324         for temp_arg in temp_args:

/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    115                 # Hide where the exception came from that shows a non-Pythonic
    116                 # JVM exception message.
--> 117                 raise converted from None
    118             else:
    119                 raise

AnalysisException: Unable to infer schema for Parquet. It must be specified manually.

I used this code:

new_DF=spark.read.parquet("v3io://projects/risk/FeatureStore/ptp/parquet/")
new_DF.show()

strange is, that it worked correctly, when I used full path to the parquet file:

new_DF=spark.read.parquet("v3io://projects/risk/FeatureStore/ptp/parquet/sets/ptp/1681296898546_70/")
new_DF.show()

Did you have similar issue?

it seems that not all parquet files under v3io://projects/risk/FeatureStore/ptp/parquet/ have the same schema — Abdennacer Lachiheb, Apr 14 '23 at 21:43
I would agree with @AbdennacerLachiheb, this is an error that you are likely to get if not all files have the same schema. Check the files again. — Zer0k, Apr 14 '23 at 23:00

score 1 · Accepted Answer · answered Apr 14 '23 at 20:39

The error is happening because the parquet file is not in "v3io://projects/risk/FeatureStore/ptp/parquet/" folder, but is in "v3io://projects/risk/FeatureStore/ptp/parquet/sets/ptp/1681296898546_70/" folder.

This will work:

new_DF=spark.read.parquet("v3io://projects/risk/FeatureStore/ptp/parquet/*/*/*")
new_DF.show()

The * syntax reads everything in the directory.

For more info about mass reading files with spark.read checkout this question: Regex for date between start- and end-date

Abdennacer Lachiheb · Answer 2 · 2023-04-15T15:15:57.640

Normally you should never see this error when reading a parquet because the schema is already in the parquet file, in my experience here are some reasons why you see this error:

The parquet you are trying to read is empty, for some strange reason spark throw this error when the parquet is empty, however this is doesn't seem the problem since you have a parquet file under "v3io://projects/risk/FeatureStore/ptp/parquet/sets/ptp/1681296898546_70/"
You may have parquet files with different schema under the path you are trying to read, for example
```
/path/
     partition=value1/
         part-000...snappy.parquet
     partition=value2/
         part-000...snappy.parquet 
```
If the 2 parquets doesn't have the same schema, you will get this error.
Under the path "v3io://projects/risk/FeatureStore/ptp/parquet/" for all subdirectories you may have files other than parquet files.

Even though the problem could be listed above, it maybe another problem because there's no enough information about the files under "v3io://projects/risk/FeatureStore/ptp/parquet/", I will suggest that you add to your post the whole tree under that path along with the schema for all parquet files.

Update

Looking closer to your path, I see the problem, this path

sets/ptp/1681296898546_70

Should be like this

Patition1=sets/Patition2=ptp/Patition3=1681296898546_70

In spark you either point directly to your parquet or to a partition directory, you cannot point to a path that neither a parquet content neither a partition.

There is only one parquet file (with valid content) in whole directory and sub-directories (it is strange situation). — JIST, Apr 15 '23 at 14:32

score 0 · Answer 3 · answered Apr 24 '23 at 19:21

0

This part of code (with recursiveFileLookup=true) helps to solve the issue also:

new_DF=spark.read.option("recursiveFileLookup","true").parquet("v3io://projects/risk/FeatureStore/ptp/parquet/")
new_DF.show()

answered Apr 24 '23 at 19:21

JIST

1,139
2
8
30

Valid parquet file, but error with parquet schema

3 Answers3