Data pipeline - Best approach to read data from network drive

Question

Source: CSV files located in a shared drive(on Prem server). Access to this shared drive and folder is controlled using a security group.

Expectation: load CSV data into Google BigQuery table.

Is it possible to mount the network drive on Dataproc cluster and let the spark application read from the mount.
Alternatively if I add the GCP Service Account as a member of security group and ssh into the network drive it will still ask for password which might impact the automated data pipeline.

What is the best approach to load these data into GBQ tables?

score 2 · Answer 1 · answered Oct 09 '22 at 13:29

2

Put the csv file in cloud storage and use that as an input. Its a SOP for the microservice.

Here it shows how to load a csv file from Cloud Storage within a spark dataproc cluster.

answered Oct 09 '22 at 13:29

netskink

1 Answers1