I'm setting up a Hadoop cluster on EC2 and I'm wondering how to do the DFS. All my data is currently in s3 and all map/reduce applications use s3 file paths to access the data. Now I've been looking at how Amazons EMR is setup and it appears that for each jobflow, a namenode and datanodes are setup. Now I'm wondering if I really need to do it that way or if I could just use s3(n) as the DFS? If doing so, are there any drawbacks?
Thanks!