-1

I am new to pyspark, my task is to copy the source folder data to destination folder using pyspark where parallelization also happen. in python i can copy the data using

from shutil import copytree
copytree(source, destination)

with this i am able to recusively copy whole data with folder structure using standard python. i want to do the same. task using pyspark on a cluster. how should i proceed , i am using YARN as resource manager.

Wenfang Du
  • 8,804
  • 9
  • 59
  • 90
sahasrara62
  • 10,069
  • 3
  • 29
  • 44

2 Answers2

1

spark allows you to manipulate data, not files. Therefore, I can offer you 2 solutions :

1 - You read your data with spark and write them where you need to :

spark.read.format(
    "my_format"
).load(
    "in_path"
).write.format(
    "my_format"
).save("out_path")

2 - the other solution is to use the hadoop tools :

from subprocess import call
call(["hdfs", "dfs", "-mv", "origine_path", "target_path"])
Steven
  • 14,048
  • 6
  • 38
  • 73
  • thanks . i like the second solution using hadoop tools ( as i have no need to worry about the file format ). i am copying the data in same volume cluster and have to read it from there . i am looking on other ways to do so also . currently i am working with the unix file system , using hard link and soft links . it will save space and data copy time.if you can help me on this , will appreciate . thanks – sahasrara62 Dec 03 '18 at 11:38
1

You can load and write as a Dataframe (example for parquet):

df = spark.read.parquet(<your_input_path>)
df.write.parquet(<your_destination_path>)

Where 'your_input_path' can be a folder and it will copy all files in it

pedvaljim
  • 108
  • 7