Why are direct writes to Amazon S3 eliminated in EMR 5.x versions?

Question

After reading this page:

http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-differences.html
"Operational Differences and Considerations" -> "Direct writes to Amazon S3 eliminated" section.

I wonder - does this mean that writing to S3 from Hive in EMR 4.x versions will be faster than 5.x versions?
If so, isn't it kind of regression? why would AWS want to eliminate this optimization?
Writing to a Hive table which is located in S3 is a very common scenario.

Can someone clear up that issue?

score 0 · Answer 1 · edited May 10 '23 at 20:16

This optimization originally was developed by Qubole and pushed to Apache Hive. See here.

This feature is rather dangerous because it is a bypass of Hive fault-tolerance mechanism and also forces developers to use normally unnecessary intermediate tables which, in its turn leads to performance degradation and increases cost.

Very common use-case is when we need to merge increment data into partitioned target table, described here The query is an insert overwrite table from itself, without intermediate table (in a single query) it is rather efficient. The query can be much more complex, with many tables joined. This is what happens with Direct Writes enabled in this use-case:

Partition folder is being deleted before query finished, this causing FileNotFound exception in Mapper reading the same table which is being written fails because partition folder deleted before mapper executed.
If the target table is initially empty, first run succeeds because Hive knows there is no any partition and does not read folder. Second run fails because see (1) folder deleted before mapper finishes.
Known workaround has performance impact. Loading data incrementally is quite often use-case. Direct writing to S3 feature forces developers to use temporary table in this case to eliminate FileNotFoundException and table corruption. As a result we are doing this task much slower and much more costly than if this feature is disabled and we are writing target table from itself.
After the first failure, successful restart is impossible, table is not selectable and not writable because Hive partition exists in metadata but folder does not exists and this causing FileNotFoundException in other queries from this table, which are not overwriting it.

The same described with less details on Amazon page which you are refering: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-differences.html

Another possible issue is described on Qubole page, some existing fix with using prefixes is mentioned, though this does not work with use-case above because writing new files into folder which is being read will anyway create a problem.

Also mappers, reducers may fail and restart, whole session may fail and restart, writing files directly even with postponed deleting the old ones seems not so good idea because it increases the chance of unrecoverable failure or data corruption.

To disable direct writes, set this configuration property:

set hive.allow.move.on.s3=true; --this disables direct write

You can use this feature for small tasks and when not reading the same table which is being written, though for small tasks it will not give you much. This optimizetion is most efficient when you are rewriting many partitions in a very big table and move task at the end is extremely slow, then you may want to enable it at risk of data corruption.

Why are direct writes to Amazon S3 eliminated in EMR 5.x versions?

1 Answers1

Linked