2

I'm trying to find missing and null values from my dataframe but I'm getting an exception. I have included only the initial few schema below:

root
|-- created_at: string (nullable = true)
|-- id: long (nullable = true)
|-- id_str: string (nullable = true)
|-- text: string (nullable = true)
|-- display_text_range: string (nullable = true)
|-- source: string (nullable = true)
|-- truncated: boolean (nullable = true)
|-- in_reply_to_status_id: double (nullable = true)
|-- in_reply_to_status_id_str: string (nullable = true)
|-- in_reply_to_user_id: double (nullable = true)
|-- in_reply_to_user_id_str: string (nullable = true)
|-- in_reply_to_screen_name: string (nullable = true)
|-- geo: double (nullable = true)
|-- coordinates: double (nullable = true)
|-- place: double (nullable = true)
|-- contributors: string (nullable = true)

Here is the code which throws the exception. I'm trying to find missing and null values here.

df_mis = df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns])
df_mis.show()

Here is the AnalysisException details:

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<ipython-input-20-6ccaacbbcc7f> in <module>()
----> 1 df_mis = df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns])
      2 df_mis.show()

2 frames
/content/spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/dataframe.py in select(self, *cols)
   1683         [Row(name='Alice', age=12), Row(name='Bob', age=15)]
   1684         """
-> 1685         jdf = self._jdf.select(self._jcols(*cols))
   1686         return DataFrame(jdf, self.sql_ctx)
   1687 

/content/spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1308         answer = self.gateway_client.send_command(command)
   1309         return_value = get_return_value(
-> 1310             answer, self.gateway_client, self.target_id, self.name)
   1311 
   1312         for temp_arg in temp_args:

/content/spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/utils.py in deco(*a, **kw)
    115                 # Hide where the exception came from that shows a non-Pythonic
    116                 # JVM exception message.
--> 117                 raise converted from None
    118             else:
    119                 raise

AnalysisException: Can't extract value from place#14: need struct type but got double
blackbishop
  • 30,945
  • 11
  • 55
  • 76
Subhransu Nanda
  • 119
  • 1
  • 3
  • 11
  • Could you provide a small sample of data that allows to reproduce the problem? Finding such a sample might even help you solve the problem by yourself. – Oli Nov 07 '21 at 18:00
  • I'm working with a Professor and he collected the data, so I'm not sure if I will be able to share a sample or not. – Subhransu Nanda Nov 07 '21 at 18:58
  • Not need for the sample to be real, one fabricated row that reproduces the problem would probably be enough. You could also reduce the number of columns. – Oli Nov 07 '21 at 19:26
  • I see what you're saying. Should be able to come up with it within the hour. – Subhransu Nanda Nov 07 '21 at 19:38
  • Hi @SubhransuNanda, you could take a look at [How to make good reproducible Apache Spark examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) – blackbishop Nov 07 '21 at 21:21

1 Answers1

5

I solved this issue by replacing dots "." in column names with underscores. I found the following stackoverflow post to be very helpful. To quote from the post, "The error is there because (.)dot is used to access a struct field".

Extracting value from data frame thorws error because of the . in the column name in spark

Subhransu Nanda
  • 119
  • 1
  • 3
  • 11