1

From Amazon's EMR FAQ:

Q: Can I load my data from the internet or somewhere other than Amazon S3?

Yes. Your Hadoop application can load the data from anywhere on the internet or from other AWS services. Note that if you load data from the internet, EC2 bandwidth charges will apply. Amazon Elastic MapReduce also provides Hive-based access to data in DynamoDB.

What are the specifications for loading data from external (non-S3) sources? There seems to be a dearth of resources around this option and doesn't appear to be documented in any form.

Community
  • 1
  • 1
  • Not sure i understand the question, are you saying "how do i load data from the internet into an EMR based Hadoop instance?" – Chris White Jun 06 '12 at 20:19
  • @ChrisWhite yeah, that's exactly what I'm asking. EMR allows data to come from places other than S3, but no explanation of how to accomplish that. – Sandeep Parikh Jun 06 '12 at 23:42

3 Answers3

2

If you want to do it "a hadoop way" you should implement DFS over your data source, or to put referances to your source URLs into some file, which will be input for the MR job.
In the same time hadoop is about moving code to data. Even EMR over S3 is not ideal in this perspectice - EC2 and S3 are different cluster. So it is hard to imegine effective MR procesing if datasource is phisically outside of the data center.

David Gruzman
  • 7,900
  • 1
  • 28
  • 30
0

Basically what Amazon is saying that programatically you can access any content from internet or any other source via your code. For example you can access a Couch database instance via any HTTP based client APIs.

Amar
  • 11,930
  • 5
  • 50
  • 73
0

I know that Cassandra package for java has one source package named org.apache.cassandra.hadoop and there are two classes in it that are needed for getting info from Cassandra when you are running the AWS Elastic MapReduce.

Essential classes: ColumnFamilyInputFormat.java and ConfigHelper.java

Go to this link to see an example of what I'm talking about.

Community
  • 1
  • 1
Walter Jr.
  • 121
  • 1
  • 7