1

I have the below pyspark dataframe.

Job_name start_time status
A        09:00:00   Not started
B        09:30:00   Completed
C        09:30:00   Running

I am trying to create a list - from above dataframe like below.

Expected output:

lst = ["job A Not started at 09:00:00", "job B Completed at 9:30", "job C Running at 9.30"]

Is there a way to convert pyspark dataframe to a list like above ?

Padfoot123
  • 1,057
  • 3
  • 24
  • 43
  • Does [this topic](https://stackoverflow.com/questions/38610559/convert-spark-dataframe-column-to-python-list) help you? – Christophe Sep 21 '21 at 11:47
  • It's similar.but not exactly same. I require all columns of a row to populate to a single output list – Padfoot123 Sep 21 '21 at 11:49
  • you could build an additional column with the concatenation of the existing ones then use the method above – Christophe Sep 21 '21 at 11:51

4 Answers4

3

Using your DF, rearranged slightly:

df = df.select('Job_name','status','start_time')

>>> df.show()
+--------+-----------+----------+
|Job_name|     status|start_time|
+--------+-----------+----------+
|       A|Not started|  09:00:00|
|       B|  Completed|  09:30:00|
|       C|    Running|  09:30:00|
+--------+-----------+----------+

Try the below, using a list comprehension:

li = []
one = [li.append(' '.join(('Job',row[0],row[1],'at',row[2]))) for row in df.collect()]

>>> li

Out[123]: 

['Job A Not started at 09:00:00',
 'Job B Completed at 09:30:00',
 'Job C Running at 09:30:00']
sophocles
  • 13,593
  • 3
  • 14
  • 33
2

You can try with df.rdd.collect() :

print(['job {}{} at {}'.format(i,j,k) for i,j,k in zip([row['Job_name']],[row['status']], [row['start_time']]) for row in df.rdd.collect()])

should do what you're looking for.

Lue Mar
  • 442
  • 7
  • 10
0

You should combine the required files together using spark before hand then finally extract the results

Data Preparation

input_str = """
A|09:00:00   |Not started|
B|09:30:00   |Completed|
C|09:30:00   |Running
""".split("|")

input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))

cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "Job_name start_time status".split()))
            
n = len(input_values)
n_cols = 3

input_list = [tuple(input_values[i:i+n_cols]) for i in range(0,n,n_cols)]

sparkDF = sql.createDataFrame(input_list, cols)

sparkDF.show()

+--------+----------+-----------+
|Job_name|start_time|     status|
+--------+----------+-----------+
|       A|  09:00:00|Not started|
|       B|  09:30:00|  Completed|
|       C|  09:30:00|    Running|
+--------+----------+-----------+

Collect

[x['output'] for x in 
    sparkDF.select(F.concat(
                    F.lit("job "),F.col('Job_name'),F.lit(" "),F.col("status"),F.lit(" at "),F.col('start_time')
                    ).alias("output")
              ).collect()

]

['job A Not started at 09:00:00',
 'job B Completed at 09:30:00',
 'job C Running at 09:30:00']
Vaebhav
  • 4,672
  • 1
  • 13
  • 33
0

Can achieve using concat and collect_list functions

from pyspark.sql import functions as f

df = df.withColumn('ncol',f.concat_ws(" ",f.lit('Job'),f.col('Job_name'),f.col('status'),f.lit("at"),f.col('start_time')))
df.show(truncate=False)

+--------+----------+-----------+-----------------------------+
|Job_name|start_time|status     |ncol                         |
+--------+----------+-----------+-----------------------------+
|A       |09:00:00  |Not started|Job A Not started at 09:00:00|
|B       |09:30:00  |Completed  |Job B Completed at 09:30:00  |
|C       |09:30:00  |Running    |Job C Running at 09:30:00    |
+--------+----------+-----------+-----------------------------+

df.select(f.collect_list('ncol')).head()[0]

['Job B Completed at 09:30:00','Job C Running at 09:30:00','Job A Not started at 09:00:00']
Suresh
  • 5,678
  • 2
  • 24
  • 40