1

I am struggling with coming up with a reasonable solution to formatting my data into an appropriate structure to input into a pyspark dataframe. I am new to pyspark so perhaps I am missing something relatively straight forward. I have a large text file ~500MB which is in the following format

1:
1488844,3,2005-09-06
822109,5,2005-05-13
885013,4,2005-10-19
30878,4,2005-12-26
823519,3,2004-05-03
2:
1076258,3,2004-06-28
1663216,2,2004-12-28
549526,3,2005-05-11
1850680,1,2005-09-17
3:
1307418,4,2005-10-15
253326,5,2005-04-15
486798,5,2005-05-27

I want to input it into a pyspark dataframe. I started by inputting it into an RDD like so:

dirPath = 'hdfs://data/movie-data/file.txt'

movieratings_RDD = sc.textFile(dirPath)

I was wondering if there was a more succinct way of extracting out the number colon values and putting them into a separate column like so:

1,1488844,3,2005-09-06
1,822109,5,2005-05-13
1,885013,4,2005-10-19
1,30878,4,2005-12-26
1,823519,3,2004-05-03
2,1076258,3,2004-06-28
2,1663216,2,2004-12-28
2,549526,3,2005-05-11
2,1850680,1,2005-09-17
etc.

I know that I could just loop through each row and regenerate the columns but I assume there is a more effective way of performing this task. I looked at the function explode but that function applies to the situation when you want to explode a grouped set of values in a column into a column where the grouped values are assigned to their own row.

user1977981
  • 189
  • 1
  • 12

0 Answers0