I am trying to unittest a function that writes data to S3 and then reads the same data from the same S3 location. I am trying to use a moto
and boto
(2.x) to achieve that [1]. The problem is that the service returns that I am forbidden to access the key [2]. A similar problem (even though that the error message is a bit different) is reported in the moto github repository [3] but it is not resolved yet.
Has anyone ever successfully tested mocked s3 read/write in PySpark to share some insights?
[1]
import boto
from boto.s3.key import Key
from moto import mock_s3
_test_bucket = 'test-bucket'
_test_key = 'data.csv'
@pytest.fixture(scope='function')
def spark_context(request):
conf = SparkConf().setMaster("local[2]").setAppName("pytest-pyspark-local-testing")
sc = SparkContext(conf=conf)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", 'test-access-key-id')
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", 'test-secret-access-key')
request.addfinalizer(lambda: sc.stop())
quiet_py4j(sc)
return sc
spark_test = pytest.mark.usefixtures("spark_context")
@spark_test
@mock_s3
def test_tsv_read_from_and_write_to_s3(spark_context):
spark = SQLContext(spark_context)
s3_conn = boto.connect_s3()
s3_bucket = s3_conn.create_bucket(_test_bucket)
k = Key(s3_bucket)
k.key = _test_key
k.set_contents_from_string('')
s3_uri = 's3n://{}/{}'.format(_test_bucket, _test_key)
df = (spark
.read
.csv(s3_uri))
[2]
(...)
E py4j.protocol.Py4JJavaError: An error occurred while calling o33.csv.
E : org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/data.csv' - ResponseCode=403, ResponseMessage=Forbidden
(...)