cannot resolve column due to data type mismatch PySpark

Question

Error being faced in PySpark:

pyspark.sql.utils.AnalysisException: "cannot resolve '`result_set`.`dates`.`trackers`['token']' due to data type mismatch: argument 2 requires integral type, however, ''token'' is of string type.;;\n'Project [result_parameters#517, result_set#518, <lambda>(result_set#518.dates.trackers[token]) AS result_set.dates.trackers.token#705]\n+- Relation[result_parameters#517,result_set#518] json\n"

Data strucutre:

-- result_set: struct (nullable = true)
 |    |-- currency: string (nullable = true)
 |    |-- dates: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- date: string (nullable = true)
 |    |    |    |-- trackers: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- countries: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- country: string (nullable = true)
 |    |    |    |    |    |    |    |-- os_names: array (nullable = true)
 |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |    |-- kpi_values: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |-- element: double (containsNull = true)
 |    |    |    |    |    |    |    |    |    |-- os_name: string (nullable = true)
 |    |    |    |    |    |-- token: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- token: string (nullable = true)

I am trying to create a view to show currency, date, and token:

df.select('result_set.currency', 'result_set.dates.date', 'result_set.dates.trackers.token').show()

Sample of data:

"result_set": {
        "token": "abcdef",
        "name": "Facebook",
        "currency": "EUR",
        "dates": [
            {
                "date": "2020-03-11",
                "trackers": [
                    {
                        "token": "12345",
                        "countries": [
                            {
                                "country": "am",
                                "os_names": [
                                    {
                                        "os_name": "android",
                                        "kpi_values": [
                                            0,
                                            0,
                                            0,
                                            0,
                                            0,
                                            0,
                                            1,
                                            0,
                                            0
                                        ]
                                    }
                                ]
                            },

I am trying to create a view based on a few of these levels within the json data.

Update:

Duplicating token

df.selectExpr('result_set.currency','explode(result_set.dates)').\
select("*","col.*").\
selectExpr("explode(trackers)","*").\
selectExpr("currency","date","explode(trackers)").\
select("currency","date","col.*").\
selectExpr("currency","date","token", "explode(countries)").\
select("currency","date","token", "col.*").\
selectExpr("currency","date","token", "country", "explode(os_names)").\
select("currency","date","token", "country", "col.*").\
selectExpr("currency","date","token", "country", "os_name", "explode(kpi_values)").\
show(20)

After doing some explodes, now the token repeats 8 times.

score 5 · Accepted Answer · answered Mar 12 '20 at 01:42

As you are accessing array of structs we need to give which element from array we need to access i.e 0,1,2..etc.

if we need to select all elements of array then we need to use explode().

Example:

df.printSchema()
#root
# |-- result_set: struct (nullable = true)
# |    |-- currency: string (nullable = true)
# |    |-- dates: array (nullable = true)
# |    |    |-- element: struct (containsNull = true)
# |    |    |    |-- date: string (nullable = true)
# |    |    |    |-- trackers: array (nullable = true)
# |    |    |    |    |-- element: struct (containsNull = true)
# |    |    |    |    |    |-- countries: array (nullable = true)
# |    |    |    |    |    |    |-- element: struct (containsNull = true)
# |    |    |    |    |    |    |    |-- country: string (nullable = true)
# |    |    |    |    |    |    |    |-- os_names: array (nullable = true)
# |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
# |    |    |    |    |    |    |    |    |    |-- kpi_values: array (nullable = true)
# |    |    |    |    |    |    |    |    |    |    |-- element: long (containsNull = true)
# |    |    |    |    |    |    |    |    |    |-- os_name: string (nullable = true)
# |    |    |    |    |    |-- token: string (nullable = true)
# |    |-- name: string (nullable = true)
# |    |-- token: string (nullable = true)

#accessing token,date from array
df.selectExpr('result_set.dates.trackers[0].token','result_set.currency', 'result_set.dates.date').show()
#+--------------------------------------------------+--------+------------+
#|result_set.dates.trackers AS trackers#194[0].token|currency|        date|
#+--------------------------------------------------+--------+------------+
#|                                           [12345]|     EUR|[2020-03-11]|
#+--------------------------------------------------+--------+------------+

#accessing first elements from dates, trackers array and extracting date,token values
df.selectExpr('result_set.dates[0].trackers[0].token as token','result_set.currency', 'result_set.dates[0].date as date').show()
#+-----+--------+----------+
#|token|currency|      date|
#+-----+--------+----------+
#|12345|     EUR|2020-03-11|
#+-----+--------+----------+

#if you need to select all elements of array then we need to explode the array and select the data
df.selectExpr('result_set.currency','explode(result_set.dates)').\
select("*","col.*").\
selectExpr("explode(trackers)","*").\
selectExpr("currency","date","explode(trackers)").\
select("currency","date","col.*").\
select("currency","date","token").\
show()

#+--------+----------+-----+
#|currency|      date|token|
#+--------+----------+-----+
#|     EUR|2020-03-11|12345|
#+--------+----------+-----+

@dataviews, Try this link https://spark.apache.org/docs/2.4.4/api/sql/index.html documented all the spark functions! — notNull, Mar 12 '20 at 02:08
how could i access all of the nested arrays? is there a way to use a loop for the arrays that I'm indexing? @Shu — dataviews, Mar 12 '20 at 02:32
@dataviews, in built spark doesn't have that function yet!, refer to these links for dynamic flattening... https://stackoverflow.com/questions/37471346/automatically-and-elegantly-flatten-dataframe-in-spark-sql https://stackoverflow.com/questions/34271398/flatten-nested-spark-dataframe https://stackoverflow.com/questions/46178325/flatten-nested-struct-in-pyspark-array — notNull, Mar 12 '20 at 02:51
this is one of the data structures I'm trying to unpack: [[[zz, [[[1347.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 7.0E-4, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], android]]]], — dataviews, Mar 12 '20 at 03:05
@dataviews, if it is an array type then try with explode (or) post as a new question, so that it will be visible to all members of community to get helped quickly! — notNull, Mar 12 '20 at 03:24
Explode worked, thank you....but if you call explode, it repeats the explode 8 times to account for all trackers — dataviews, Mar 12 '20 at 04:27
@dataviews, if trackers array have 8 objects in it then yes.. in case to flatten **`countries`** in trackers then we need to do explode again for **`countries`** — notNull, Mar 12 '20 at 04:43
i updated my code at the bottom of my question if you want to take a look — dataviews, Mar 12 '20 at 04:48
Yes, that is expected behavior from `spark` as token is two level up from `kpi_values` when we do explode we will get duplicated `token` associated with all the nested fields. http://sqlandhadoop.com/spark-dataframe-explode/ https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView — notNull, Mar 12 '20 at 05:02

cannot resolve column due to data type mismatch PySpark

1 Answers1

Linked