0

I am trying to get some logging out of my mapper jobs, running on Dataproc.

Following the advice here, I simply defined a log4j logger and info'ed to it:

import org.apache.log4j.Logger;

public class SampleMapper extends Mapper<LongWritable, Text, Text, Text> {
private Logger logger = Logger.getLogger(SampleMapper.class);

@Override
protected void setup(Context context) {
    logger.info("Initializing NoSQL Connection.")
    try {
        // logic for connecting to NoSQL - ommitted
    } catch (Exception ex) {
        logger.error(ex.getMessage());
    }
}

@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    // mapper code ommitted
}

}

However I can't find any logs anywhere, not through Dataproc user interface, not by calling yarn logs on the master, and not even when logging in to the worker instances and searching in various sensible places.

Is there any configuration I am missing that should make it work?

Where is the default log4j configuration read from and how can I aggregate it?

daphshez
  • 9,272
  • 11
  • 47
  • 65

2 Answers2

1

I'm surprised this isn't documented, but logs from all YARN containers are available in Stackdriver logging. In the Cloud Console, go to Stackdriver -> Logging -> Logs, look for your cluster under Cloud Dataproc Cluster -> cluster name -> cluster uuid. Then, select yarn-userlogs, which includes logs from all containers. You can filter by the application or container ids (which are fields in the json payload).

If you want YARN to collect logs for you on the cluster, consider setting up YARN log aggregation (instructions).

Karthik Palaniappan
  • 1,373
  • 8
  • 11
0

This thread explains that logs are placed in /tmp in each worker, and it recommends configuring some yarn properties to use a GCS bucket. Although you can collect them, they won't be shown in Stackdriver, to do this you may want to use the google-cloud-logging library to send to Stackdriver your custom messages, for example:

import com.google.cloud.logging.Logging;
import com.google.cloud.logging.LoggingOptions;

LoggingOptions options = LoggingOptions.getDefaultInstance();
try(Logging logging = options.getService()) {
  // use logging here
}

Regarding the Stackdriver approach you can find more information here.

rsantiago
  • 2,054
  • 8
  • 17