I have followed this blog to read data stored in google bucket. https://cloud.google.com/dataproc/docs/connectors/install-storage-connector It has worked fine. The following command
hadoop fs -ls gs://the-bucket-you-want-to-list
gave me expected results.But when I tried reading data using pyspark using
rdd = sc.textFile("gs://crawl_tld_bucket/")
,
it throws the following error:
`
py4j.protocol.Py4JJavaError: An error occurred while calling o20.partitions.
: java.io.IOException: No FileSystem for scheme: gs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
`
How to get it done?