Using java to read files in HDFS and match multi-line blocks by regex

Question

I am working with a log analysis tool.

I am using YARN log aggregation function with Hadoop. When I do this, the Hadoop log file is so large that some API methods cloud not entirely read content of files into memory.

I want to match multi-line blocks within the files where the first line contains the string [map] and the last line contains [\map] - I think I can do this based on a regex. The commonly used BufferedReader could not satisfy my requirements.

My question is: Is there another way to go through the file line by line, checking for those matching my Regex?

P.S. I do not really want to split the file into multiple smaller files to process, as I am concerned that this will lead to some matching content not being found as I might split the file in the middle of a matching block.

The following is fragment of log file - I want the section between [MAP] and [/MAP]:

2015-04-16 20:30:09,240 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: dump TS struct
2015-04-16 20:30:09,240 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: 

    [MAP]Id =4
      [Children]
        [TS]Id =2
          [Children]
            [RS]Id =3
              [Parent]Id = 2 null[\Parent]
            [\RS]
         [\Children]
         [Parent>Id = 4 null[\Parent]
       [\TS]
      [\Children]
    [\MAP]

2015-04-16 20:30:09,241 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: Initializing Self 4 MAP
2015-04-16 20:30:09,242 INFO [main] org.apache.hadoop.hive.ql.exec.TableScanOperator: Initializing Self 2 TS
2015-04-16 20:30:09,242 INFO [main] org.apache.hadoop.hive.ql.exec.TableScanOperator: Operator 2 TS initialized

Hi - I've edited your question to help out with more readable English. You can improve it further by 1) Confirming it was `BufferedReader` you meant - not `BufferReader` 2) showing the code you've tried so far - and how it fails due to the large file size 3) Explaining your concern about splitting the file - are you worried that the split might occur in the middle of matching content? If so - there might be ways to deal with that. Adding that detail will make you more likely to get good help and less likely to see your question unanswered / closed. Good luck! — J Richard Snape, Apr 28 '15 at 08:57
@stribizhev As you said third point , i worried the split might occur in the middle of matching content. You seem to be have a solution ? Could you share it with me ? Tanks once again! :) — sol, Apr 28 '15 at 09:16
Thanks @stribizhev. Well - if you're bothered about the content being split, but you know it will be on one line - you could always use a command line terminal to export only full lines (e.g. on *nix http://stackoverflow.com/questions/7764755/). But you need to address 1) and 2) first. If you can show what you've tried (edit it into question) and how it failed - I might be able to answer. Currently - I can't. You might also mention which version of Java you're using (Java 8 might give you more options than 7) — J Richard Snape, Apr 28 '15 at 10:23

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

N.B. EDITED following clarification in comments

It may be possible to find your multi-line blocks using Regex - you can certainly write a Regex that would match them e.g. .*\[MAP\]((?s).*)\[\\MAP\] - noting that in Java you would also have to escape all the \ characters and that (?s) allows the . character to match newlines i.e.

String mapBlockRegex = ".*\\[MAP\\]((?s).*)\\[\\\\MAP\\]";`

However - as you have highlighted - this produces difficulties if the file won't fit in memory and splitting it also has some difficulty.

I'll propose a different idea - scan the file line by line and use a state variable to indicate whether you are in a block or not. Basic algorithm as follows

When you match the start of the block, set the state variable true.
While state is true, append the text to a StringBuilder
When you match the end of block, set the state variable false and use the String you have built up, e.g. output it to file, to the console or use it programatically.

Java solution

I'll suggest one way to implement the above - using a Scanner - which goes through a stream line by line, discarding them as it goes, thus avoiding OutOfMemoryError. Note this code can throw exceptions - I have thrown them on, but you could put them in a try..catch..finally block. Also note that Scanner swallows IO Exception but, as the docs say, if this is important to you:

The most recent IOException thrown by the underlying readable can be retrieved via the ioException() method.

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.util.Scanner;


public class LogScanner
{

    public static void main(String[] args) throws FileNotFoundException
    {
        FileInputStream inputStream = null;
        Scanner sc = null;

        String path = "D:\\hadoopTest.log";
        String blockStart= ".*\\[MAP\\].*";
        String blockEnd = ".*\\[\\\\MAP\\].*";
        boolean inBlock = false;
        StringBuilder block = null;

        inputStream = new FileInputStream(path);
        sc = new Scanner(inputStream, "UTF-8");
        while (sc.hasNextLine()) {
            String line = sc.nextLine();
            if (line.matches(blockStart)) {
                inBlock = true;
                block = new StringBuilder();
            }

            if (inBlock) {
                block.append(line);
                block.append("\n");
            }

            if (line.matches(blockEnd)) {
                inBlock = false;
                String completeBlock = block.toString();
                System.out.println(completeBlock);
                // I'm outputting the blockto stdout, you could append to a file\whatever.
            }
        }

        sc.close();
    }
}

Caveat Your file may have characteristics where this won't work without some adaptations. If you can have nested [map] blocks, then inBlock will need to be an int, where you increment if you match the block start and decrement if you match the end - appending for any inblock > 0 and only outputting the complete string when inBlock goes to zero.

Command line split where seeking matches on a single line

If you were searching on a per-line basis and matches were guaranteed to be on a single line, then splitting would be OK as long as the splits only happened at the end of complete lines.

In that case, you could use the command line to split down the file. If you're on Linux (or, I think, any *nix) you can use the split command, e.g.

split --lines=75000

There is more detail in this question and answer

On Windows there isn't an equivalent command that I'm aware of, but you can install things that will do similar - e.g. GNU CoreUtils for Windows or 7-Zip. Caveat: I've never used these for splitting.

@J Richard Snape Hi, i appended fragment of log file which file is large. I assume that match the content([map]..[/map]) between labels pair, when the split occur the middle of labels pair, The matched content will be miss. — sol, Apr 29 '15 at 02:24
Yes - thanks - the example makes it much clearer what you are trying to do. That means the answer above won't work on its own. I will edit it when I have time (I'll also rephrase the question slightly so that the language makes it clear that you're actually looking for a multi-line block) — J Richard Snape, Apr 29 '15 at 08:35
@sol solution edited and tested with your sample. For each `[map]` to [`\map]` block you'll get a string in `completeBlock` that you can use. — J Richard Snape, Apr 29 '15 at 11:34
@ J Richard Snape Thank you. I appreciate your help.！Now ,i have been a similar solution work on it. you are a warmhearted man! :) — sol, Apr 30 '15 at 17:21
@sol Thanks, I'm glad it helped. I'd be grateful if you accept the answer if it helped ([click the tick next to the answer](http://meta.stackexchange.com/a/5235/261369)) and vote on it. Good luck with your programming. — J Richard Snape, Apr 30 '15 at 20:48

score 0 · Answer 2 · answered Apr 28 '15 at 10:23

0

Instead of Buffered reader you can use java NIO package , which is very fast compared to Buffered reader

answered Apr 28 '15 at 10:23

Shaik Mujahid Ali

2,308
7
26
40

Using java to read files in HDFS and match multi-line blocks by regex

2 Answers2

Java solution

Command line split where seeking matches on a single line