How to get the input file name in the mapper in a Hadoop program?

Question

How I can get the name of the input file within a mapper? I have multiple input files stored in the input directory, each mapper may read a different file, and I need to know which file the mapper has read.

score 48 · Answer 1 · answered Sep 25 '13 at 18:41

48

First you need to get the input split, using the newer mapreduce API it would be done as follows:

context.getInputSplit();

But in order to get the file path and the file name you will need to first typecast the result into FileSplit.

So, in order to get the input file path you may do the following:

Path filePath = ((FileSplit) context.getInputSplit()).getPath();
String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();

Similarly, to get the file name, you may just call upon getName(), like this:

String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

answered Sep 25 '13 at 18:41

Amar

11,930
5
50
73

2

make sure you chose the right class to include (mapred vs mapreduce) – Gavriel Apr 10 '14 at 12:20
Out of curiosity, how did you figure this out? The documentation of getInputSplit doesn't suggest that this is possible (at least to me...). – Mzzzzzz May 18 '15 at 15:25
1

This solution doesn't work anymore for multiple inputs, as the input split class returned is `TaggedInputSplit`, not `FileSplit`. – Hans Brende Mar 26 '18 at 06:38
See: https://stackoverflow.com/a/49502905/2599133 for a solution that works for `TaggedInputSplit` as well. – Hans Brende Mar 27 '18 at 01:39

Tariq · Answer 2 · 2013-09-25T20:01:30.543

17

Use this inside your mapper :

FileSplit fileSplit = (FileSplit)context.getInputSplit();
String filename = fileSplit.getPath().getName();

Edit :

Try this if you want to do it inside configure() through the old API :

String fileName = new String();
public void configure(JobConf job)
{
   filename = job.get("map.input.file");
}

edited Sep 25 '13 at 20:01

answered Sep 25 '13 at 18:40

Tariq

34,076
8
57
79

I try to user ``context`` but does not have a method called ``getInputSplit``. Am I using old API? Besides, can I do this things in the configure function instead of mapper? – HHH Sep 25 '13 at 19:42
1

With the latest hadoop 2.6.0 this does not work in mapreduce can you suggest on this. – Raghuveer Dec 18 '14 at 07:34
1

In the end, I need to resort to some fiendish reflection hackery, it works! http://stackoverflow.com/questions/11130145/hadoop-multipleinputs-fails-with-classcastexception/11130420#11130420 – ruhong Apr 07 '15 at 13:46

score 15 · Answer 3 · answered Jun 22 '14 at 17:07

If you are using Hadoop Streaming, you can use the JobConf variables in a streaming job's mapper/reducer.

As for the input file name of mapper, see the Configured Parameters section, the map.input.file variable (the filename that the map is reading from) is the one can get the jobs done. But note that:

Note: During the execution of a streaming job, the names of the "mapred" parameters are transformed. The dots ( . ) become underscores ( _ ). For example, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar. To get the values in a streaming job's mapper/reducer use the parameter names with the underscores.

For example, if you are using Python, then you can put this line in your mapper file:

import os
file_name = os.getenv('map_input_file')
print file_name

This worked locally, but in EMR using Yarn, I needed to use the suggestion in http://stackoverflow.com/questions/20915569/how-can-to-get-the-filename-from-a-streaming-mapreduce-job-in-r Specifically: `os.getenv('mapreduce_map_input_file')` — Bob Baxley, Jan 28 '16 at 20:48

score 5 · Answer 4 · answered Jul 20 '16 at 13:32

If you're using the regular InputFormat, use this in your Mapper:

InputSplit is = context.getInputSplit();
Method method = is.getClass().getMethod("getInputSplit");
method.setAccessible(true);
FileSplit fileSplit = (FileSplit) method.invoke(is);
String currentFileName = fileSplit.getPath().getName()

If you're using CombineFileInputFormat, it's a different approach because it combines several small files into one relatively big file (depends on your configuration). Both the Mapper and RecordReader run on the same JVM so you can pass data between them when running. You need to implement your own CombineFileRecordReaderWrapper and do as follows:

public class MyCombineFileRecordReaderWrapper<K, V> extends RecordReader<K, V>{
...
private static String mCurrentFilePath;
...
public void initialize(InputSplit combineSplit , TaskAttemptContext context) throws IOException, InterruptedException {
        assert this.fileSplitIsValid(context);
        mCurrentFilePath = mFileSplit.getPath().toString();
        this.mDelegate.initialize(this.mFileSplit, context);
    }
...
public static String getCurrentFilePath() {
        return mCurrentFilePath;
    }
...

Then, in your Mapper, use this:

String currentFileName = MyCombineFileRecordReaderWrapper.getCurrentFilePath()

Hope I helped :-)

score 3 · Answer 5 · edited Dec 27 '15 at 09:12

Noticed on Hadoop 2.4 and greater using the old api this method produces a null value

String fileName = new String();
public void configure(JobConf job)
{
   fileName = job.get("map.input.file");
}

Alternatively you can utilize the Reporter object passed to your map function to get the InputSplit and cast to a FileSplit to retrieve the filename

public void map(LongWritable offset, Text record,
        OutputCollector<NullWritable, Text> out, Reporter rptr)
        throws IOException {

    FileSplit fsplit = (FileSplit) rptr.getInputSplit();
    String inputFileName = fsplit.getPath().getName();
    ....
}

score 2 · Answer 6 · edited Aug 28 '17 at 10:54

2

You have to first convert in to InputSplit by typecasting and then you need to type cast to FileSplit.

Example:

InputSplit inputSplit= (InputSplit)context.getInputSplit();
Path filePath = ((FileSplit) inputSplit).getPath();
String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString()

edited Aug 28 '17 at 10:54

Hamid Rouhani

2,309
2
31
45

answered Mar 07 '14 at 10:36

Sainagaraju Vaduka

91
2

score 2 · Answer 7 · answered Apr 14 '17 at 11:44

2

This helped me:

String fileName = ((org.apache.hadoop.mapreduce.lib.input.FileSplit) context.getInputSplit()).getPath().getName();

answered Apr 14 '17 at 11:44

Max Gabderakhmanov

912
1
18
36

Hans Brende · Answer 8 · 2018-03-27T01:27:25.487

The answers which advocate casting to FileSplit will no longer work, as FileSplit instances are no longer returned for multiple inputs (so you will get a ClassCastException). Instead, org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit instances are returned. Unfortunately, the TaggedInputSplit class is not accessible without using reflection. So here's a utility class I wrote for this. Just do:

Path path = MapperUtils.getPath(context.getInputSplit());

in your Mapper.setup(Context context) method.

Here is the source code for my MapperUtils class:

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.lang.invoke.MethodHandle;
import java.lang.invoke.MethodHandles;
import java.lang.invoke.MethodType;
import java.lang.reflect.Method;
import java.util.Optional;

public class MapperUtils {

    public static Path getPath(InputSplit split) {
        return getFileSplit(split).map(FileSplit::getPath).orElseThrow(() -> 
            new AssertionError("cannot find path from split " + split.getClass()));
    }

    public static Optional<FileSplit> getFileSplit(InputSplit split) {
        if (split instanceof FileSplit) {
            return Optional.of((FileSplit)split);
        } else if (TaggedInputSplit.clazz.isInstance(split)) {
            return getFileSplit(TaggedInputSplit.getInputSplit(split));
        } else {
            return Optional.empty();
        }
    }

    private static final class TaggedInputSplit {
        private static final Class<?> clazz;
        private static final MethodHandle method;

        static {
            try {
                clazz = Class.forName("org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit");
                Method m = clazz.getDeclaredMethod("getInputSplit");
                m.setAccessible(true);
                method = MethodHandles.lookup().unreflect(m).asType(
                    MethodType.methodType(InputSplit.class, InputSplit.class));
            } catch (ReflectiveOperationException e) {
                throw new AssertionError(e);
            }
        }

        static InputSplit getInputSplit(InputSplit o) {
            try {
                return (InputSplit) method.invokeExact(o);
            } catch (Throwable e) {
                throw new AssertionError(e);
            }
        }
    }

    private MapperUtils() { }

}

can you rewrite it using Java 7? – nimbus_debug Aug 01 '18 at 09:08 — nimbus_debug, Aug 01 '18 at 09:08

score 0 · Answer 9 · answered Nov 25 '15 at 05:39

For org.apache.hadood.mapred package the map function signature should be:

map(Object, Object, OutputCollector, Reporter)

So, to get the file name inside the map function, you could use the Reporter object like this:

String fileName = ((FileSplit) reporter.getInputSplit()).getPath().getName();

score -1 · Answer 10 · answered Aug 01 '18 at 09:28

package com.foo.bar;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.lang.invoke.MethodHandle;
import java.lang.invoke.MethodHandles;
import java.lang.invoke.MethodType;
import java.lang.reflect.Method;

public class MapperUtils {

    public static Path getPath(InputSplit split) {
        FileSplit fileSplit = getFileSplit(split);
        if (fileSplit == null) {
            throw new AssertionError("cannot find path from split " + split.getClass());
        } else {
            return fileSplit.getPath();
        }
    }

    public static FileSplit getFileSplit(InputSplit split) {
        if (split instanceof FileSplit) {
            return (FileSplit)split;
        } else if (TaggedInputSplit.clazz.isInstance(split)) {
            return getFileSplit(TaggedInputSplit.getInputSplit(split));
        } else {
            return null;
        }
    }

    private static final class TaggedInputSplit {
        private static final Class<?> clazz;
        private static final MethodHandle method;

        static {
            try {
                clazz = Class.forName("org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit");
                Method m = clazz.getDeclaredMethod("getInputSplit");
                m.setAccessible(true);
                method = MethodHandles.lookup().unreflect(m).asType(
                    MethodType.methodType(InputSplit.class, InputSplit.class));
            } catch (ReflectiveOperationException e) {
                throw new AssertionError(e);
            }
        }

        static InputSplit getInputSplit(InputSplit o) {
            try {
                return (InputSplit) method.invokeExact(o);
            } catch (Throwable e) {
                throw new AssertionError(e);
            }
        }
    }

    private MapperUtils() { }

}

I rewrite the code hans-brende provides in Java 7, it worked. But there is a problem that

File Input Format Counters Bytes Read=0 Bytes Read is zero if using MultipleInputs.

score -1 · Answer 11 · edited Apr 05 '21 at 16:57

-1

With multiple inputs like this :

-Dwordcount.case.sensitive=false
hdfs://192.168.178.22:9000/user/hduser/inWiki
hdfs://192.168.178.22:9000/user/hduser/outWiki1
hdfs://192.168.178.22:9000/user/joe/wordcount/dict/dictionary.txt
-skip hdfs://192.168.178.22:9000/user/joe/wordcount/patterns.txt

For the file dictionary.txt I've written a procedure inside Map Code

edited Apr 05 '21 at 16:57

mazaneicha

8,794
4
33
52

answered Apr 04 '21 at 13:33

Colonna Maurizio

79
5

How to get the input file name in the mapper in a Hadoop program?

11 Answers11

Linked