Reading CSV file with Spark runs sometimes forever

Question

i'm using Spark 2.4.8 with the gcs-connector from com.google.cloud.bigdataoss in version hadoop2-2.1.8. For development i'm using a Compute Engine VM with my IDE. I try to consume some CSV files from a GCS bucket natively with the Spark .csv(...).load(...) functionality. Some files are loaded successfully, but some are not. Then in the Spark UI i can see that the load job runs forever until a timeout fires.

But the weird thing is, that when i run the same application packaged to a Fat-JAR in Dataproc cluster, all the same files can be consumed successfully.

What i am doing wrong?

Are you getting any error messages? Does this [stack link](https://stackoverflow.com/questions/61197811/can-i-read-csv-files-from-google-storage-using-spark-in-more-than-one-executor/61209050#61209050) help? Can you provide sample code/command you are using? — Prajna Rai T, Dec 04 '22 at 14:10
Hi, yes, with gcs-connector in version hadoop2-2.2.8 the files can be written in the IDEA too. Strange, but yeah, it's resolved. — JanOels, Dec 13 '22 at 14:25
Hi @JanOels, I have posted the answer as Community wiki. So If my answer addressed your question, please consider upvoting and accepting it. If not, let me know so that the answer can be improved. Accepting an answer will help the community members with their research as well. — Prajna Rai T, Dec 15 '22 at 18:28

score 2 · Accepted Answer · answered Dec 15 '22 at 18:28

@JanOels, As you have mentioned in the comment, using gcs-connector in version hadoop2-2.2.8 will resolve this issue and the latest version of hadoop2 is hadoop2-2.2.10.

For more information about all the versions of hadoop2 to use gcs-connector from com.google.cloud.bigdataoss this document can be referred.

Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.

Feel free to edit this answer for additional information.

Reading CSV file with Spark runs sometimes forever

1 Answers1