Read parquet files in s3 with Jupyter Notebook local with pyspark in windows

Question

I try use Jupyter Notebook to consult files in s3. I try this code: In my computer I've installed Pyspark, Java

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local') \
    .appName('myAppName') \
    .config('spark.executor.memory', '5gb') \
    .config("spark.cores.max", "6") \
    .getOrCreate()
df = spark.read.parquet("s3://path/to/parquet/file.parquet")

But when i run this code, appear: I try with another codes but i always have the same error

Py4JJavaError: An error occurred while calling o30.parquet.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
    at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
    at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:752)
    at scala.collection.immutable.List.map(List.scala:293)
    at org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:750)
    at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:579)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
    at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:562)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Unknown Source)

What can I do ?

this may help you `https://stackoverflow.com/questions/46740670/no-filesystem-for-scheme-s3-with-pyspark` — iambdot, Oct 28 '22 at 20:32

Vaebhav · Answer 1 · 2022-10-30T07:48:24.397

0

You need to ensure additional dependent libraries are present before you attempt to read data sources from S3

You can refer this answer as a reference - java.io.IOException: No FileSystem for scheme: s3 to setup your enviornment accordingly

edited Oct 30 '22 at 07:48

answered Oct 30 '22 at 05:41

Vaebhav

4,672
1
13
33

Read parquet files in s3 with Jupyter Notebook local with pyspark in windows

1 Answers1