How to test mocked (moto/boto) S3 read/write in PySpark

Question

I am trying to unittest a function that writes data to S3 and then reads the same data from the same S3 location. I am trying to use a moto and boto (2.x) to achieve that [1]. The problem is that the service returns that I am forbidden to access the key [2]. A similar problem (even though that the error message is a bit different) is reported in the moto github repository [3] but it is not resolved yet.

Has anyone ever successfully tested mocked s3 read/write in PySpark to share some insights?

[1]

import boto
from boto.s3.key import Key
from moto import mock_s3

_test_bucket = 'test-bucket'
_test_key = 'data.csv'

@pytest.fixture(scope='function')
def spark_context(request):
    conf = SparkConf().setMaster("local[2]").setAppName("pytest-pyspark-local-testing")
    sc = SparkContext(conf=conf)
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", 'test-access-key-id')
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", 'test-secret-access-key')
    request.addfinalizer(lambda: sc.stop())
    quiet_py4j(sc)
    return sc

spark_test = pytest.mark.usefixtures("spark_context")

@spark_test
@mock_s3
def test_tsv_read_from_and_write_to_s3(spark_context):
    spark = SQLContext(spark_context)

    s3_conn = boto.connect_s3()
    s3_bucket = s3_conn.create_bucket(_test_bucket)
    k = Key(s3_bucket)
    k.key = _test_key 
    k.set_contents_from_string('')    

    s3_uri = 's3n://{}/{}'.format(_test_bucket, _test_key)
    df = (spark
          .read
          .csv(s3_uri))

[2]

(...)
E py4j.protocol.Py4JJavaError: An error occurred while calling o33.csv.
E : org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/data.csv' - ResponseCode=403, ResponseMessage=Forbidden
(...)

[3] https://github.com/spulec/moto/issues/1543

Can you share some specific snippet of unittest code that you have tried and worked for you? My usecase involves a build system that does not have access to network other than local host to build and test — Vassilis Moustakas, Apr 29 '19 at 16:01
@VassilisMoustakas were you able to resolve this issue? I am able to read but writing gives error. Seems like an issue with the moto response ERROR XmlResponsesSaxParser: Unable to parse integer value ' 1' java.lang.NumberFormatException: For input string: " 1" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:569) at java.lang.Integer.parseInt(Integer.java:615) at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseInt(XmlResponsesSaxParser.java:238)...``` — Harmeet, Sep 23 '19 at 12:51
Νο I've never managed to figure this out. If you make it work please post an answer. Thanks! — Vassilis Moustakas, Sep 23 '19 at 12:56
@VassilisMoustakas I have not been able to figure out with moto but it works with Localstack. Please see this article https://medium.com/@davidsmithtech/integrating-spark-with-localstack-s3-4f4c85487362 — Harmeet, Apr 12 '20 at 13:50

score 1 · Answer 1 · answered Apr 21 '22 at 18:48

moto is a library which is used to mock aws resources.

1. Create the resource:

If you try to access an S3 Bucket which doesn't exist, aws will return a Forbidden error.

Usually, we need these resources created even before our tests run. So, create a pytest fixture with autouse set to True

import pytest
import boto3
from moto import mock_s3

@pytest.fixture(autouse=True)
def fixture_mock_s3():
    with mock_s3():
        conn = boto3.resource('s3', region_name='us-east-1')
        conn.create_bucket(Bucket='MYBUCKET') # an empty test bucket is created
        yield

The above code creates a mock s3 bucket with name "MUBUCKET". The bucket is empty.

The name of the bucket should be same as that of the original bucket.

with autouse, the fixture is automatically available across tests.

You can confidently run tests, as your tests will not have access to the original bucket.

2. Define and run tests involving the resource:

Suppose, you have code that writes a file to S3 Bucket

def write_to_s3(filepath: str):
    s3 = boto3.resource('s3', region_name='us-east-1')    
    s3.Bucket('MYBUCKET').upload_file(filepath, 'A/B/C/P/data.txt')

This can be tested the following way:

from botocore.errorfactory import ClientError

def test_write_to_s3():
    dummy_file_path = f"{TEST_DIR}/data/dummy_data.txt"
    # The s3 bucket is created by the fixture and not lies empty
    # test for emptiness
    s3 = boto3.resource('s3', region_name='us-east-1')
    bucket = s3.Bucket("MYBUCKET")
    objects = list(bucket.objects.filter(Prefix="/"))
    assert objects == []
    # Now, lets write a file to s3
    write_to_s3(dummy_file_path)
    # the below assert statement doesn't throw any error
    assert s3.head_object(Bucket='MYBUCKET', Key='A/B/C/P/data.txt')

Using this I could upload (I uploaded a parquet data.parquet), but when I tried to read as below spark.read.parquet('S3://MYBUCKET/A/B/C/P/data.parquet') It gave forbidden 403 error.What can be the solution to be able to read as above ? — Saurabh Rana, Feb 15 '23 at 13:18

How to test mocked (moto/boto) S3 read/write in PySpark

1 Answers1