2

I need to find the first complete pair of parentheses in a Java String and, if it is non-nested, return its content. The current issue is that parentheses may be represented by different characters in different locales/languages.

My first idea was of course to use regular expressions. But beside the fact that it seems quite difficult (at least to me) to make sure that there are no nested parentheses in the currently considered match if something like "\((.*)\)" is used, there seems to be no class of parenthesis-like characters available in Java's Matcher.

Thus, I tried to solve the problem more imperatively, but stumbled across the issue that the data I need to process is in different languages, and there are different parentheses' characters depending on the locale. Western: (), Chinese (Locale "zh"): ()

package main;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.StringReader;
import java.util.HashSet;
import java.util.Set;

public class FindParentheses {

    static public Set<String> searchNames(final String string) throws IOException {
        final Set<String> foundName = new HashSet<>();
        final BufferedReader stringReader = new BufferedReader(new StringReader(string));
        for (String line = stringReader.readLine(); line != null; line = stringReader.readLine()) {
            final int indexOfFirstOpeningBrace = line.indexOf('(');
            if (indexOfFirstOpeningBrace > -1) {
                final String afterFirstOpeningParenthesis = line.substring(indexOfFirstOpeningBrace + 1);
                final int indexOfNextOpeningParenthesis = afterFirstOpeningParenthesis.indexOf('(');
                final int indexOfNextClosingParenthesis = afterFirstOpeningParenthesis.indexOf(')');
                /*
                 * If the following condition is fulfilled, there is a simple braced expression
                 * after the found product's short name. Otherwise, there may be an additional
                 * nested pair of braces, or the closing brace may be missing, in which cases the
                 * expression is rejected as a product's long name.
                 */
                if (indexOfNextClosingParenthesis > 0
                    && (indexOfNextClosingParenthesis < indexOfNextOpeningParenthesis
                        || indexOfNextOpeningParenthesis < 0)) {
                    final String content = afterFirstOpeningParenthesis.substring(0, indexOfNextClosingParenthesis);
                    foundName.add(content);
                }
            }
        }
        return foundName;
    }

    public static void main(final String args[]) throws IOException {
        for (final String foundName : searchNames(
            "Something meaningful: shortName1 (LongName 1).\n" +
                "Localization issue here: shortName2 (保险丝2). This one should be found, too.\n" +
                "Easy again: shortName3 (LongName 3).\n" +
            "Yet more random text...")) {
            System.out.println(foundName);
        }
    }

}

The second thing with Chinese parentheses is not found, but should be. Of course I might match those characters as an additional special case, but as my project uses 23 languages, including Korean and Japanese, I would prefer a solution that finds any pairs of parentheses.

  • 1
    Well, you might try using `String par_rx = "\\p{Ps}[^\\p{Ps}\\p{Pe}]*\\p{Pe}";`, but these are not quite consistent (as is the case with ORNATE PARENTHESIS). I think you need to enumerate the parentheses you want to support and use a regex like `(?<=\()[^()]+(?=\))|(?<=()[^()]+(?=))|etc.` – Wiktor Stribiżew Aug 01 '19 at 15:27
  • I have to sleep sometimes, so I have not had the opportunity to completely implement and test one yet. I'm about to do so, though. – Gunnar Arndt Aug 02 '19 at 08:23
  • 1
    Same problems with sleep here, too :) Take your time, I just wanted to make sure the question does not end up without a confirmed answer. – Wiktor Stribiżew Aug 02 '19 at 08:25

3 Answers3

2

I'm guessing that you might want to design an expression, maybe similar to:

[((]\s*([^))]*)\s*[))]

where your desired parentheses would go in these char classes:

[((]

Test

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class re{
    public static void main(String[] args){
        final String regex = "[((]\\s*([^))]*)\\s*[))]";
        final String string = "Something meaningful: shortName1 (LongName 1) Localization issue here: shortName2 (保险丝2). This one should be found, too. Easy again: shortName3 (LongName 3). Yet more random text... Something meaningful: shortName1 (LongName 1) Localization issue here: shortName2 (保险丝2). This one should be found, too. Easy again: shortName3 (LongName 3). Yet more random text...";

        final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
        final Matcher matcher = pattern.matcher(string);

        while (matcher.find()) {
            System.out.println("Full match: " + matcher.group(0));
            for (int i = 1; i <= matcher.groupCount(); i++) {
                System.out.println("Group " + i + ": " + matcher.group(i));
            }
        }
    }
}

Output

Full match: (LongName 1)
Group 1: LongName 1
Full match: (保险丝2)
Group 1: 保险丝2
Full match: (LongName 3)
Group 1: LongName 3
Full match: (LongName 1)
Group 1: LongName 1
Full match: (保险丝2)
Group 1: 保险丝2
Full match: (LongName 3)
Group 1: LongName 3

Another option would be:

(?<=[((])[^))]*(?=[))])    

which would output:

Full match: LongName 1
Full match: 保险丝2
Full match: LongName 3
Full match: LongName 1
Full match: 保险丝2
Full match: LongName 3

Demo

The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

Reference

List of all unicode's open/close brackets?

Emma
  • 27,428
  • 11
  • 44
  • 69
2

You may make use of \p{Ps} Punctuation, Open and \p{Pe}, Punctuation, Close, Unicode category classes.

String par_paired_punct = "\\p{Ps}([^\\p{Ps}\\p{Pe}]*)\\p{Pe}";

They match a bit more than parentheses, but you may exclude the chars you do not want "manually".

In Punctuation, Open class, the following chars are not left brackets or parentheses:

U+0F3A  TIBETAN MARK GUG RTAGS GYON ༺   
U+0F3C  TIBETAN MARK ANG KHANG GYON ༼   
U+169B  OGHAM FEATHER MARK  ᚛   
U+201A  SINGLE LOW-9 QUOTATION MARK ‚   
U+201E  DOUBLE LOW-9 QUOTATION MARK „   
U+27C5  LEFT S-SHAPED BAG DELIMITER ⟅   
U+29D8  LEFT WIGGLY FENCE   ⧘   
U+29DA  LEFT DOUBLE WIGGLY FENCE    ⧚   
U+2E42  DOUBLE LOW-REVERSED-9 QUOTATION MARK    ⹂   
U+301D  REVERSED DOUBLE PRIME QUOTATION MARK    〝   
U+FD3F  ORNATE RIGHT PARENTHESIS    ﴿   

In the Punctuation, Close class, the following are not paired bracket chars:

U+0F3B  TIBETAN MARK GUG RTAGS GYAS ༻   
U+0F3D  TIBETAN MARK ANG KHANG GYAS ༽   
U+169C  OGHAM REVERSED FEATHER MARK ᚜   
U+27C6  RIGHT S-SHAPED BAG DELIMITER    ⟆   
U+29D9  RIGHT WIGGLY FENCE  ⧙   
U+29DB  RIGHT DOUBLE WIGGLY FENCE   ⧛
U+301E  DOUBLE PRIME QUOTATION MARK 〞
U+301F  LOW DOUBLE PRIME QUOTATION MARK 〟   
U+FD3E  ORNATE LEFT PARENTHESIS ﴾   

And the regex will look like

String par_rx = "[\\p{Ps}&&[^\\u0F3\\u0F3C\\u169B\\u201A\\u201E\\u27C5\\u29D8\\u29DA\\u2E42\\u301D\\uFD3F]]" +
                 "((?:[^\\p{Ps}\\p{Pe}]|[\\u0F3\\u0F3C\\u169B\\u201A\\u201E\\u27C5\\u29D8\\u29DA\\u2E42\\u301D\\uFD3F\\u0F3B\\u0F3D\\u169C\\u27C6\\u29D9\\u29DB\\u301E\\u301F\\uFD3E])*)" +
                 "[\\p{Pe}&&[^\\u0F3B\\u0F3D\\u169C\\u27C6\\u29D9\\u29DB\\u301E\\u301F\\uFD3E]]";
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks for pointing out the categories. Is there any definite list of uses of certain characters (of such a class) in different languages? But I think I'll just collect all parentheses chars from them, and just ignore locales/languages. – Gunnar Arndt Aug 01 '19 at 16:27
  • 1
    @GunnarArndt Regex and those categories are more script oriented than language oriented. – Wiktor Stribiżew Aug 01 '19 at 16:46
0

Emma's answer links to Brian Campbell's list of all Unicode brackets. I used it to enumerate all relevant characters, as Wiktor Stribiżew suggested; in my case, all parentheses are of interest.

In addition, I preferred to make sure that only matching parentheses are considered, which led me to this ugly regular expression in Java:

public static final String ANY_PARENTHESES = "\\([^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+\\)|⁽[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+⁾|₍[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+₎|❨[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+❩|❪[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+❫|⟮[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+⟯|⦅[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+⦆|⸨[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+⸩|﴾[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+﴿|︵[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+︶|﹙[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+﹚|([^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+)|⦅[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+⦆";

which I actually constructed with the following code:

    public static final char LEFT_PARENTHESIS = '\u0028', // (
        SUPERSCRIPT_LEFT_PARENTHESIS = '\u207D', // ⁽
        SUBSCRIPT_LEFT_PARENTHESIS = '\u208D', // ₍
        MEDIUM_LEFT_PARENTHESIS_ORNAMENT = '\u2768', // ❨
        MEDIUM_FLATTENED_LEFT_PARENTHESIS_ORNAMENT = '\u276A', // ❪
        MATHEMATICAL_LEFT_FLATTENED_PARENTHESIS = '\u27EE', // ⟮
        LEFT_WHITE_PARENTHESIS = '\u2985', // ⦅
        LEFT_DOUBLE_PARENTHESIS = '\u2E28', // ⸨
        ORNATE_LEFT_PARENTHESIS = '\uFD3E', // ﴾
        PRESENTATION_FORM_FOR_VERTICAL_LEFT_PARENTHESIS = '\uFE35', // ︵
        SMALL_LEFT_PARENTHESIS = '\uFE59', // ﹙
        FULLWIDTH_LEFT_PARENTHESIS = '\uFF08', // (
        FULLWIDTH_LEFT_WHITE_PARENTHESIS = '\uFF5F'; // ⦅

    public static final char RIGHT_PARENTHESIS = '\u0029', // )
        SUPERSCRIPT_RIGHT_PARENTHESIS = '\u207E', // ⁾
        SUBSCRIPT_RIGHT_PARENTHESIS = '\u208E', // ₎
        MEDIUM_RIGHT_PARENTHESIS_ORNAMENT = '\u2769', // ❩
        MEDIUM_FLATTENED_RIGHT_PARENTHESIS_ORNAMENT = '\u276B', // ❫
        MATHEMATICAL_RIGHT_FLATTENED_PARENTHESIS = '\u27EF', // ⟯
        RIGHT_WHITE_PARENTHESIS = '\u2986', // ⦆
        RIGHT_DOUBLE_PARENTHESIS = '\u2E29', // ⸩
        ORNATE_RIGHT_PARENTHESIS = '\uFD3F', // ﴿
        PRESENTATION_FORM_FOR_VERTICAL_RIGHT_PARENTHESIS = '\uFE36', // ︶
        SMALL_RIGHT_PARENTHESIS = '\uFE5A', // ﹚
        FULLWIDTH_RIGHT_PARENTHESIS = '\uFF09', // )
        FULLWIDTH_RIGHT_WHITE_PARENTHESIS = '\uFF60'; // ⦆

    public static final String NO_PARENTHESES = "[^\\" + LEFT_PARENTHESIS + SUPERSCRIPT_LEFT_PARENTHESIS
        + SUBSCRIPT_LEFT_PARENTHESIS + MEDIUM_LEFT_PARENTHESIS_ORNAMENT + MEDIUM_FLATTENED_LEFT_PARENTHESIS_ORNAMENT
        + MATHEMATICAL_LEFT_FLATTENED_PARENTHESIS + LEFT_WHITE_PARENTHESIS + LEFT_DOUBLE_PARENTHESIS
        + ORNATE_LEFT_PARENTHESIS + PRESENTATION_FORM_FOR_VERTICAL_LEFT_PARENTHESIS + SMALL_LEFT_PARENTHESIS
        + FULLWIDTH_LEFT_PARENTHESIS + FULLWIDTH_LEFT_WHITE_PARENTHESIS + "\\" + RIGHT_PARENTHESIS
        + SUPERSCRIPT_RIGHT_PARENTHESIS + SUBSCRIPT_RIGHT_PARENTHESIS + MEDIUM_RIGHT_PARENTHESIS_ORNAMENT
        + MEDIUM_FLATTENED_RIGHT_PARENTHESIS_ORNAMENT + MATHEMATICAL_RIGHT_FLATTENED_PARENTHESIS
        + RIGHT_WHITE_PARENTHESIS + RIGHT_DOUBLE_PARENTHESIS + ORNATE_RIGHT_PARENTHESIS
        + PRESENTATION_FORM_FOR_VERTICAL_RIGHT_PARENTHESIS + SMALL_RIGHT_PARENTHESIS + FULLWIDTH_RIGHT_PARENTHESIS
        + FULLWIDTH_RIGHT_WHITE_PARENTHESIS + "]+";

    public static final String PARENTHESES = "\\" + LEFT_PARENTHESIS + NO_PARENTHESES + "\\" + RIGHT_PARENTHESIS;

    public static final String SUPERSCRIPT_PARENTHESES =
        "" + SUPERSCRIPT_LEFT_PARENTHESIS + NO_PARENTHESES + SUPERSCRIPT_RIGHT_PARENTHESIS;

    public static final String SUBSCRIPT_PARENTHESES =
        "" + SUBSCRIPT_LEFT_PARENTHESIS + NO_PARENTHESES + SUBSCRIPT_RIGHT_PARENTHESIS;

    public static final String MEDIUM_PARENTHESES_ORNAMENT =
        "" + MEDIUM_LEFT_PARENTHESIS_ORNAMENT + NO_PARENTHESES + MEDIUM_RIGHT_PARENTHESIS_ORNAMENT;

    public static final String MEDIUM_FLATTENED_PARENTHESES_ORNAMENT =
        "" + MEDIUM_FLATTENED_LEFT_PARENTHESIS_ORNAMENT + NO_PARENTHESES + MEDIUM_FLATTENED_RIGHT_PARENTHESIS_ORNAMENT;

    public static final String MATHEMATICAL_FLATTENED_PARENTHESES =
        "" + MATHEMATICAL_LEFT_FLATTENED_PARENTHESIS + NO_PARENTHESES + MATHEMATICAL_RIGHT_FLATTENED_PARENTHESIS;

    public static final String WHITE_PARENTHESES =
        "" + LEFT_WHITE_PARENTHESIS + NO_PARENTHESES + RIGHT_WHITE_PARENTHESIS;

    public static final String DOUBLE_PARENTHESES =
        "" + LEFT_DOUBLE_PARENTHESIS + NO_PARENTHESES + RIGHT_DOUBLE_PARENTHESIS;

    public static final String ORNATE_PARENTHESES =
        "" + ORNATE_LEFT_PARENTHESIS + NO_PARENTHESES + ORNATE_RIGHT_PARENTHESIS;

    public static final String PRESENTATION_FORM_FOR_VERTICAL_PARENTHESES =
        "" + PRESENTATION_FORM_FOR_VERTICAL_LEFT_PARENTHESIS + NO_PARENTHESES
        + PRESENTATION_FORM_FOR_VERTICAL_RIGHT_PARENTHESIS;

    public static final String SMALL_PARENTHESES =
        "" + SMALL_LEFT_PARENTHESIS + NO_PARENTHESES + SMALL_RIGHT_PARENTHESIS;

    public static final String FULLWIDTH_PARENTHESES =
        "" + FULLWIDTH_LEFT_PARENTHESIS + NO_PARENTHESES + FULLWIDTH_RIGHT_PARENTHESIS;

    public static final String FULLWIDTH_WHITE_PARENTHESES =
        "" + FULLWIDTH_LEFT_WHITE_PARENTHESIS + NO_PARENTHESES + FULLWIDTH_RIGHT_WHITE_PARENTHESIS;

    public static final char XOR = '|';

    public static final String ANY_PARENTHESES = PARENTHESES
        + XOR + SUPERSCRIPT_PARENTHESES
        + XOR + SUBSCRIPT_PARENTHESES
        + XOR + MEDIUM_PARENTHESES_ORNAMENT
        + XOR + MEDIUM_FLATTENED_PARENTHESES_ORNAMENT
        + XOR + MATHEMATICAL_FLATTENED_PARENTHESES
        + XOR + WHITE_PARENTHESES
        + XOR + DOUBLE_PARENTHESES
        + XOR + ORNATE_PARENTHESES
        + XOR + PRESENTATION_FORM_FOR_VERTICAL_PARENTHESES
        + XOR + SMALL_PARENTHESES
        + XOR + FULLWIDTH_PARENTHESES
        + XOR + FULLWIDTH_WHITE_PARENTHESES;

Note however that it does not reject nested parentheses.