Spark array of arrays: how to extract first element of each subarray (structtype)?

Asked Apr 23 '19 at 16:23

Active Apr 23 '19 at 16:36

Viewed 37 times

I have a spark structured streaming dataframe column that contains an array of arrays, like:

[[ts1, url1], ... , [tsN, urlN]]

where each subarray is a struct:

schema_visits = StructType(
   fields = [
      StructField("timestamp", LongType(), True),
      StructField("url", StringType(), True),

])

and N is variable from row to row.

What I'd like is to get a column with urls only:

[url1, ... , urlN]

I think I could use explode(), getItem("url") , groupby+agg+collect_list to achieve that, but I wonder if there is simpler way?

Since it's Structured Streaming, udf are probably not going to work, but if they are, that would be interesting as well.

edited Apr 23 '19 at 16:36

asked Apr 23 '19 at 16:23

Artem Trunov

0 Answers0