4

I'm executing the following code to convert a pyspark dataframe into pandas dataframe

dt = '2022-03-22'
sample_df = spark.sql(f'''select * from orders where order_date = '{dt}' limit 10''')
sample_df.toPandas()

but it throws the following error

File ~/conda/envs/custom_env/lib/python3.9/site-packages/pandas/_libs/tslibs/timezones.pyx:134, in pandas._libs.tslibs.timezones.maybe_get_tz()

File ~/conda/envs/custom_env/lib/python3.9/site-packages/pytz/__init__.py:188, in timezone(zone)
    186             fp.close()
    187     else:
--> 188         raise UnknownTimeZoneError(zone)
    190 return _tzinfo_cache[zone]

UnknownTimeZoneError: 'IST'

Can anyone please explain what is going on here or provide any resolution?

I can see the result when I don't convert dataframe to pandas, for example sample_df.show() works.

mazaneicha
  • 8,794
  • 4
  • 33
  • 52
Kunal Sawant
  • 483
  • 2
  • 8

1 Answers1

0

I think because order_date is of datetime64 type which cannot be used and supported by pyspark dataframe. The only supported date types in pyspark are Timestamp and Date. Convert that column to one of them then you can convert it pandas later. (Here is the link for Timestamp conversion in spark: pyspark.pandas.to_datetime)

Maybe there are other workarounds to this issue. Such as:

  1. Use the latest Spark (3.4.1)
  2. Use Pandas version 1.5.3

You can read more from this QA here.

Habib Karbasian
  • 556
  • 8
  • 18