1

I have this data as output when i perform timeStamp_df.head() in pyspark:

Row(timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-05-03T11:30:16.900+0000)', timeStamp='ISODate(2020-04-03T11:30:16.900+0000)')

My expected output is:

+-------------------------------+
|timeStamp                      |
+-------------------------------+
|2020-06-03T11:30:16.900+0000|
|2020-06-03T11:30:16.900+0000|
|2020-06-03T11:30:16.900+0000|
|2020-05-03T11:30:16.900+0000|
|2020-04-03T11:30:16.900+0000|
+-------------------------------+

I tried to first use .collect() method and want to iterate

rows_list = timeStamp_df.collect()
print(rows_list)

It's output is:

[Row(timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-05-03T11:30:16.900+0000)', timeStamp='ISODate(2020-04-03T11:30:16.900+0000)')]

Just to see the values I am using the print statement:

def print_row(row):
    print(row.timeStamp)


for row in rows_list:
    print_row(row)

But I am getting the single output as it only iterates once in list:

ISODate(2020-06-03T11:30:16.900+0000)

How can I iterate over the data of Row in pyspark?

AB21
  • 353
  • 1
  • 4
  • 15

1 Answers1

2
  1. You cannot repeat keyword arguments when creating a Row.
  2. A valid Row is iterable:
row = Row(a=10, b=20, c=30)
print([column for column in row])

[10, 20, 30]
boyangeor
  • 381
  • 3
  • 6