-1

I am creating some java code that takes correctly written .java files as input, and i want to extract the text between braces using a regular expression. I want to use the Pattern and Matcher classes, and not for loops.

I believe its best to create a regex that groups the text in the whole class, and later another regex that will be aplied to the previous output and groups the text in methods.

I got close to getting the class text using the following regex on online regex testers:

\w\sclass.*\{((.*\s*)*)\}

but i'm pretty sure i am doing it wrong by using two groups instead of just one. Furthermore when i use this expression in Java i am actually getting nothing.

Here is an example file that i am using for debugging

package foo.bar;

import java.io.File;

public class Handy {
    {
    // static block, dont care!
    }

    /**
     * Check if a string is Null and Empty
     * @param str
     * @return
     */
    public static boolean isNullOrEmpty(String str) {
        Boolean result = (str == null || str.isEmpty());
        return result;
    }

    /**
     * Mimics the String.format method with a smaller name
     * @param format
     * @param args
     * @return
     */
    public static String f(String format, Object... args)
    {
        return String.format(format, args);
    }
}

With the example code above, i expect to get:

  • entire class text
{
// static block, dont care!
}

/**
 * Check if a string is Null and Empty
 * @param str
 * @return
 */
public static boolean isNullOrEmpty(String str) {
    Boolean result = (str == null || str.isEmpty());
    return result;
}

/**
 * Mimics the String.format method with a smaller name
 * @param format
 * @param args
 * @return
 */
public static String f(String format, Object... args)
{
    return String.format(format, args);
}
  • individual method text
Boolean result = (str == null || str.isEmpty());
return result;
return String.format(format, args);

I know how to use the Pattern and Matcher classes already, i just need the right regexes...

Luis Ferreira
  • 121
  • 11
  • 3
    You sort of forgot to tell us exactly what you want to match here, but that doesn't really matter, because regex is not a suitable tool for parsing nested source code, which is what Java is. – Tim Biegeleisen May 06 '19 at 14:00
  • Your match is in the first capturing group https://regex101.com/r/M47iI9/2 – The fourth bird May 06 '19 at 14:01
  • Unless this is an educative project I strongly suggest that you use an existing parser, you'll find many exist for Java. – Aaron May 06 '19 at 14:02
  • Note that generally irregular problems such as Java source code isn't a good fit for regular expressions. You'll very likely run into situations that cause unwanted false positives or negatives or would require massively complex expressions to handle (if you even can get them right). Also note that the expression `(.*\s*)*` already is very vulnerable to [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html) (and `.* ` matches anything even whitespace so just `.*` would be mostly equivalent). – Thomas May 06 '19 at 14:04
  • @TimBiegeleisen i did say, but maybe not so clearly. Basically I want text that makes the individual methods, and the name of the corresponding method. I dont care about logic. Do you have another idea instead of regex? – Luis Ferreira May 06 '19 at 14:11
  • Here's some example of why your regex could match the wrong thing even if you get it right for your testcases: imagine you'd want to parse a file that's in a package named `classification`. Thus the file would start with `package classification` - and that would be matched by `\w\sclass.*`. Ans that's just _one_ simple example - I admit it's a little construed but unless you really have control over what the code looks like chances are that something like this will happen eventually. – Thomas May 06 '19 at 14:11
  • @Aaron it is an experimental project, i dont care at all about the logic of the code or if there are errors, i just want the characters that make up a certain method, all of them. I understand it might seem strange to be parsing code like it was just simple text, but that's what i am searching for, not parsing proper Java code with the intention to run it – Luis Ferreira May 06 '19 at 14:11
  • @Thefourthbird maybe i am doing some mistake with backslashes when converting this to proper Java code, but it should work right? And is there a cleaner way to do this without using two groups? – Luis Ferreira May 06 '19 at 14:13
  • Well, a parser doesn't necessarily have to only accept compilable or runnable code. And it's easier to find (potential) errors in the code when using a parser as opposed to a regex (which for example couldn't distinguish between actual code and commented code or strings). – Thomas May 06 '19 at 14:15
  • One problem with your expression is that `.*` matches anything, even an empty string of length 0 - and it's a greedy quantifier so the match will be as long as possible. Thus `.*\{((.*\s*)*)\}` should result int the following group provided that `.` matches linebreaks as well (that can be set via a flag and we don't know whether you do that or not): `"return String.format(format, args); }"` (it starts at the last opening brace and matches until the last closing brace). – Thomas May 06 '19 at 14:20
  • 3
    Regex is not for code parsing. That should be done with lexers/parsers. – duffymo May 06 '19 at 14:20
  • @duffymo ok i'll rephrase, again. I don't care about the code itself, the logic, i just want the plain text that makes up a certain method and class. From the options i know, regex seemed to work, but i am open to others. I am avoiding real code parsers because it seems like a much more complex task then a regex – Luis Ferreira May 06 '19 at 14:25
  • 1
    Nope. You shouldn't avoid. There's a quip that says "You've got a problem. You think regex is a good solution. Now you have two problems." True for your case. – duffymo May 06 '19 at 14:27
  • +1 for lexer/parser such as [ANTLR](https://fr.wikipedia.org/wiki/ANTLR), but if you don't want to bother learning to use them (they're not trivial to use) then go for a character by character parser based on `Scanner` in which you'll implement what to do when you parse a `{` or `}` – Aaron May 06 '19 at 14:27
  • 1
    @Aaron just "parsing" character by character could get complex too though: you'd need to distinguish between initializer blocks, classes (outer and inner), braces in comments such as JavaDoc's `{@code}`, braces in string literals (not that uncommon), methods, blocks inside methods etc. – Thomas May 06 '19 at 14:45
  • Well if the problem is complex (and I don't think OP cares about half the feature you mentionned) the solution will be complex, and the more complex it gets the more time will be gained by learning to use a proper lexer/parser. That said a character-by-character parser will solve regex's glaring problem (can't represent infinitely nested paired brackets) and would be a good introduction to parsing – Aaron May 06 '19 at 14:50
  • I think the real answer is: Get a good IDE. IntelliJ from JetBrains is the best IDE on the market. Use it. They are much smarter that all of us. – duffymo May 06 '19 at 15:04

1 Answers1

0

After some confusion in the comments section, i would like to share my solution for what i asked, even if it was not very clear.

This is not thoroughly tested code, but it works for my purpose. Some adjustments or improvements are very likely possible. I took some inspiration from the comments i read in this post, and others like this.

I feed each of the following methods the entire plain text found in a .java file, and from there i use Pattern and Matcher to extract what i want.

private static String patternMatcher(String content, String patternText, int groupIndex) {
    Pattern pattern = Pattern.compile(patternText);
    Matcher matcher = pattern.matcher(content);

    if (matcher.find()) {
        return matcher.group(groupIndex);
    } else {
        return "";
    }
}

public static String getPackageName(String content) {
    return patternMatcher(content, ".*package\\s+(.*)\\s*\\;", 1);
}

public static String getClassName(String content) {
    return patternMatcher(content, ".*class\\s+(\\w+)[\\w\\s]+\\{", 1);
}

public static String getClassCode(String content) {
    return patternMatcher(content, ".*class.*\\{((.*\\s*)*)\\}", 1);
}

public static String getMethodName(String code) {
    String uncommentedCode = removeComments(code).trim();

    return patternMatcher(uncommentedCode,
            "(public|private|static|protected|abstract|native|synchronized) *([\\w<>.?, \\[\\]]*)\\s+(\\w+)\\s*\\([\\w<>\\[\\]._?, \\n]*\\)\\s*([\\w ,\\n]*)\\s*\\{",
            3);
}

public static String removeComments(String content) {
    return content.replaceAll("\\/\\*[\\s\\S]*?\\*\\/|([^:]|^)\\/\\/.*$", "$1 ").trim();
}

I double checked but i hope i didn't forget any escape character, be carefull with those.

Lots of people recomended that i used an actual code parsing library, like ANTLR, but i assumed it would take much longer for me to learn how to work with it, then it would take to do with with RegEx. Furthermore i wanted to improve my Regex skills, this exercise definitely taught me some things.

Luis Ferreira
  • 121
  • 11