Hadoop InputSplit for large text-based files

Question

In hadoop I'd like to split a file (almost) equally to each mapper. The file is large and I want to use a specific number of mappers at which are defined at job start. Now I've customized the input split but I want to be sure that if I split the file in two (or more splits) I won't cut a line in half as I want each mapper to have complete lines and not broken ones.

So the question is this, how can I get the approximate size of a filesplit during each creation or if that is not possible how I can estimate the number of (almost) equal filesplits for a large file given the constraint that I don't want to have any broken lines in any mapper instance.

Look this answer, I think it will help you. http://stackoverflow.com/a/14540272/2436237 — gasparms, May 11 '14 at 12:32

score 1 · Answer 1 · edited May 23 '17 at 11:50

Everything that you are asking for is the default behavior in Map Reduce. Like mappers always process complete lines. By default Map Reduce strives to spread out the load among st mappers evenly.

You can get more details about it here you can check out the InputSplits para.

Also this answer here as linked by @Shaw, talks about how exactly the case of lines spread across blocks splits is handled.

A think a through reading of the hadoop bible should clear out most of your doubts in thsi regard

Hadoop InputSplit for large text-based files

1 Answers1