1

I have a spark structured streaming dataframe column that contains an array of arrays, like:

[[ts1, url1], ... , [tsN, urlN]]

where each subarray is a struct:

schema_visits = StructType(
   fields = [
      StructField("timestamp", LongType(), True),
      StructField("url", StringType(), True),

])

and N is variable from row to row.

What I'd like is to get a column with urls only:

[url1, ... , urlN]

I think I could use explode(), getItem("url") , groupby+agg+collect_list to achieve that, but I wonder if there is simpler way?

Since it's Structured Streaming, udf are probably not going to work, but if they are, that would be interesting as well.

Artem Trunov
  • 1,340
  • 9
  • 16

0 Answers0