copy folder data from source location to destination locaton using pyspark

Question

I am new to pyspark, my task is to copy the source folder data to destination folder using pyspark where parallelization also happen. in python i can copy the data using

from shutil import copytree
copytree(source, destination)

with this i am able to recusively copy whole data with folder structure using standard python. i want to do the same. task using pyspark on a cluster. how should i proceed , i am using YARN as resource manager.

Steven · Accepted Answer · 2018-12-03T10:01:09.790

1

spark allows you to manipulate data, not files. Therefore, I can offer you 2 solutions :

1 - You read your data with spark and write them where you need to :

spark.read.format(
    "my_format"
).load(
    "in_path"
).write.format(
    "my_format"
).save("out_path")

2 - the other solution is to use the hadoop tools :

from subprocess import call
call(["hdfs", "dfs", "-mv", "origine_path", "target_path"])

edited Dec 03 '18 at 10:01

answered Dec 03 '18 at 09:55

Steven

14,048
6
38
73

thanks . i like the second solution using hadoop tools ( as i have no need to worry about the file format ). i am copying the data in same volume cluster and have to read it from there . i am looking on other ways to do so also . currently i am working with the unix file system , using hard link and soft links . it will save space and data copy time.if you can help me on this , will appreciate . thanks – sahasrara62 Dec 03 '18 at 11:38

pedvaljim · Answer 2 · 2018-12-03T10:34:52.003

1

You can load and write as a Dataframe (example for parquet):

df = spark.read.parquet(<your_input_path>)
df.write.parquet(<your_destination_path>)

Where 'your_input_path' can be a folder and it will copy all files in it

edited Dec 03 '18 at 10:34

answered Dec 03 '18 at 09:58

pedvaljim

108
7

copy folder data from source location to destination locaton using pyspark

2 Answers2