I have a spark structured streaming dataframe column that contains an array of arrays, like:
[[ts1, url1], ... , [tsN, urlN]]
where each subarray is a struct:
schema_visits = StructType(
fields = [
StructField("timestamp", LongType(), True),
StructField("url", StringType(), True),
])
and N is variable from row to row.
What I'd like is to get a column with urls only:
[url1, ... , urlN]
I think I could use explode(), getItem("url") , groupby+agg+collect_list to achieve that, but I wonder if there is simpler way?
Since it's Structured Streaming, udf are probably not going to work, but if they are, that would be interesting as well.