Duplicate dataset with millions of rows using pyspark

Asked Dec 14 '21 at 09:47

Active Dec 14 '21 at 14:53

Viewed 205 times

I am trying to duplicate a dataset which has 30 rows to around 600 Million rows. I am currently using a for loop to iterate and perform union but it is taking a lot of time. Is there any better way to create duplicate rows in pyspark to this huge volume?

Thank you.

edited Dec 14 '21 at 14:53

asked Dec 14 '21 at 09:47

Samyak Jain

1

Does this answer your question? [Pyspark: how to duplicate a row n time in dataframe?](https://stackoverflow.com/questions/50624745/pyspark-how-to-duplicate-a-row-n-time-in-dataframe) – vladsiv Dec 14 '21 at 10:12

Duplicate dataset with millions of rows using pyspark

0 Answers0