0

I'm trying to filter GA Sessions on PySpark based on the customDimensions. The data is like

+--------------------+--------------------+                                     
|       fullVisitorId|                  cd|
+--------------------+--------------------+
| 5823179578207509663|[[1, app_tv], [36...|
| 5220700153870728639|[[107, live], [10...|
|16421406313456036559|[[1, app_tv], [36...|
|18135892068782985696|[[1, app_tv], [36...|
| 5865612025708664451|[[1, app_tv], [36...|
| 8103574485485735385|[[1, web], [36, d...|
| 6603732532553270294|[[1, web], [36, m...|
|   70498423600813735|[[1, web], [36, d...|
| 5017675391641460547|[[1, web], [36, d...|
+--------------------+--------------------+

Using the GA Schema, the cd (customDimensions) column has an array containing several tuples of index, value pairs.

How can I, efficiently, select the fullVisitorIds that has, for example, an entry with index = 107 and value = 'live' like in the second entry on the example

waaat
  • 95
  • 2
  • 6

0 Answers0