1

Related to the question below but I'm still struggling:

Load S3 Data into AWS SageMaker Notebook

I'm trying to load a parquet file from a local S3 bucket (it contains "sagemaker" in the name.

I'm trying to access the file using both conventions (Object URL of the file and the usually seen):

pf1 = ParquetFile("https://s3.amazonaws.com/sagemaker-us-east-1-716296694085/data/t_spp_vcen_cons_sales_fact-part-1.parquet")
pf1 = ParquetFile("s3://sagemaker-us-east-1-716296694085/data/t_spp_vcen_cons_sales_fact-part-1.parquet")
df1 = pf1.to_pandas()

It says FileNotFoundError but the file is there. The funny thing is when I create a model and use BOTO I actually am able to "write" to the same bucket:

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, np.array(train_X).astype('float32'), np.array(train_y).astype('float32'))
buf.seek(0)
key = 'linear_train.data'
prefix = "Sales_867_ts"
boto3.resource('s3').Bucket(bucket_write).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)
s3_train_data = 's3://{}/{}/train/{}'.format(bucket_write, prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))

So a couple of newbies questions:

  • do I also need BOTO to read the file and if so - how do I do that?

  • do I need somehow to amend my IAM role and do this without the "boto" command?

  • when I move the data to Jupyter I actually have no issues reading it directly. Where exactly is this data stored then?

pf1 = ParquetFile("./Sales_867_ts/inputData/t_spp_vcen_cons_sales_fact-part-1.parquet")

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
Bullzeye
  • 153
  • 1
  • 11
  • Hi Bullzeye, could you post a little bit more of your code, especially your imports? Specifically, I'd like to understand which python ParquetFile implementation you're using. (There are a few out there.) – Kevin McCormick Apr 16 '19 at 19:50

1 Answers1

0

just import s3fs and then df = pd.read_csv. you have to do a conda install on the s3fs library though

vanetoj
  • 103
  • 11