1

I have these strings;

wordsExpanded="test |  is |  [(thirty four) {<number_type_0 words>}( 3  4 ) {<number_type_0 digits>}] |  test |  [(three) {<number_type_1 words>}( 3 ) {<number_type_1 digits>}] |  [(one) {<number_type_2 words>}( 1 ) {<number_type_2 digits>}]"

interpretation="{<number_type_2 digits> <number_type_1 digits> <number_type_0 words>}"

what I need as output is a string like this;

finalOutput="test |  is | thirty four | test | 3 | 1 "

Basically the interpretation string has the informations needed to determine which group has been used. For the first one, we used and therefore the proper string is "(thirty four)" and not "( 3 4 )" The second one would be "( 3 )" and then "( 1 )"

Here is my code so far;

package com.test.prova;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Prova {

    public static void main(String[] args) {
        String nlInterpretation="{<number_type_2 digits> <number_type_1 digits> <number_type_0 words>}";
        String inputText="this is 34 test 3 1";
        String grammar="test is [(thirty four) {<number_type_0 words>}( 3  4 ) {<number_type_0 digits>}] test [(three) {<number_type_1 words>}( 3 ) {<number_type_1 digits>}] [(one) {<number_type_2 words>}( 1 ) {<number_type_2 digits>}]";

        List<String> matchList = new ArrayList<String>();
        Pattern regex = Pattern.compile("[^\\s\"'\\[]+|\\[([^\\]]*)\\]|'([^']*)'");
        Matcher regexMatcher = regex.matcher(grammar);
        while (regexMatcher.find()) {
            if (regexMatcher.group(1) != null) {
                matchList.add(regexMatcher.group(1));
            } else if (regexMatcher.group(2) != null) {
                matchList.add(regexMatcher.group(2));
            } else {
                matchList.add(regexMatcher.group());
            }
        } 

        String[] xx = matchList.toArray(new String[0]);
        String[] yy = inputText.split(" ");

        matchList = new ArrayList<String>();
        regex = Pattern.compile("[^<]+|<([^>]*)>");
        regexMatcher = regex.matcher(nlInterpretation);
        while (regexMatcher.find()) {
            if (regexMatcher.group(1) != null) {
                matchList.add(regexMatcher.group(1));
            }
        } 
        String[] zz = matchList.toArray(new String[0]);
        System.out.println(String.join(" | ",zz));

        for (int i=0; i<xx.length; i++) {
            if (xx[i].contains("number_type_")) {
                matchList = new ArrayList<String>();
                regex = Pattern.compile("[^\\(]+|<([^\\)]*)>.*[^<]+|<([^>]*)>");
                regexMatcher = regex.matcher(xx[i]);
                while (regexMatcher.find()) {
                    if (regexMatcher.group(1) != null) {
                        matchList.add(regexMatcher.group(1));
                    } else if (regexMatcher.group(2) != null) {
                        matchList.add(regexMatcher.group(2));
                    } else {
                        matchList.add(regexMatcher.group());
                    }
                } 
                System.out.println(String.join(" | ",matchList.toArray(new String[0])));
            }
            System.out.printf("%02d\t%s\t->%s\n", i, yy[i], xx[i]);
        }
    }
}

The output generated is as follow;

number_type_2 digits | number_type_1 digits | number_type_0 words
00  this    ->test
01  is  ->is
thirty four) {<number_type_0 words>} |  3  4 ) {<number_type_0 digits>}
02  34  ->(thirty four) {<number_type_0 words>}( 3  4 ) {<number_type_0 digits>}
03  test    ->test
three) {<number_type_1 words>} |  3 ) {<number_type_1 digits>}
04  3   ->(three) {<number_type_1 words>}( 3 ) {<number_type_1 digits>}
one) {<number_type_2 words>} |  1 ) {<number_type_2 digits>}
05  1   ->(one) {<number_type_2 words>}( 1 ) {<number_type_2 digits>}

What I would like is more like this;

number_type_2 digits | number_type_1 digits | number_type_0 words
00  this    ->test
01  is      ->is
02  34      ->thirty four
03  test    ->test
04  3       ->3
05  1       ->1
ekad
  • 14,436
  • 26
  • 44
  • 46
  • Could you show a third example? – alayor Feb 13 '17 at 04:19
  • it is very unclear. Give an example – Sagar V Feb 13 '17 at 04:20
  • I have NO IDEA what this question means. Are the grey parts the actual strings, or are they metasyntactic variables? – Dawood ibn Kareem Feb 13 '17 at 04:21
  • I hope this is more clear. – Andre Couture Feb 13 '17 at 04:21
  • 1
    No, I'm afraid it really isn't. – Dawood ibn Kareem Feb 13 '17 at 04:23
  • I need to replace in wordsExpanded the groups delimited by square bracket with proper matching sequences based on the string interpretation. Note that interpretation string has 3 blocks in this example, "" and "" and "". in the first string we find 2 options for number_type_0, one with "words" and the other with "digits". I need to match the right one and return the associated string that is found in ( ) just before. – Andre Couture Feb 13 '17 at 04:24
  • in this group "[(thirty four) {}( 3 4 ) {}]" there are 2 parts; "(thirty four) {}" and the second is "( 3 4 ) {" from the interpretation string we have 3 case, for number_type_0 we used "words", for the other 2 we used "digits". So the goal is to return the string in () that match the interpretation for that number_type – Andre Couture Feb 13 '17 at 04:28
  • It seems this involves a lot of parsing and string processing. Have you written some code so far? If you have, please add it to the question. – alayor Feb 13 '17 at 04:31
  • I would post the main portions which you are least sure of, to begin with. Then we can go from there. It helps to have a general idea where you're headed with this as the question and description is rather confusing. – Darkphoton Feb 13 '17 at 04:36
  • Hope this has enough info and clarity to help me. I'm using java 8 if that make a difference. – Andre Couture Feb 13 '17 at 04:42
  • I almost understood your requirement here. I just have a quick question. Does your `interoperation` String remains the same or is it bound to change? I mean does it always be `{ }` or will it change? – Shyam Baitmangalkar Feb 13 '17 at 07:20
  • The interpretation string is bound to change each time. For example we could have If position 0 used digits. We could also have variable length items but always the same format. The only thing that will be constant is the format. ( ... ) where TYPE could be words or digits. – Andre Couture Feb 13 '17 at 11:57

2 Answers2

0

I'm writing a solution based on the assumption that the format of your String interpretation remains the same i.e. {<number_type_2 digits> <number_type_1 digits> <number_type_0 words>} and it doesn't change.

I'll describe both Java 7 and Java 8 methodologies. And I'm making this very clear that my algorithm runs in exponential time and it's a straight forward naive approach. I couldn't think of anything more faster in a short time.

Let's start walking through the code:

Java-7 style

/*
     * STEP 1: Create a method that accepts wordsExpanded and
     * interpretation Strings
     */
    public static void parseString(String wordsExpanded, String interoperation) {
        /*
         * STEP 2: Remove leading and tailing curly braces form
         * interoperation String
         */
        interoperation= interoperation.replaceAll("\\{", "");
        interoperation = interoperation.replaceAll("\\}", "");

        /*
         * STEP 3: Split your interoperation String at '>'
         * because we need individual interoperations  like
         * "<number_type_2 words" to compare. 
         */
        String[] allInterpretations = interoperation.split(">");

        /*
         * STEP 4: Split your wordsExpanded String at '|'
         * to get each word.
         */
        String[] allWordsExpanded = wordsExpanded.split("\\|");

        /*
         * STEP 5: Create a resultant StringBuilder
         */
        StringBuilder resultBuilder = new StringBuilder();

        /*
         * STEP 6: Iterate over each words form wordsExpanded
         * after splitting.
         */
        for(String eachWordExpanded : allWordsExpanded){
            /*
             * STEP 7: Remove leading and tailing spaces
             */
            eachWordExpanded = eachWordExpanded.trim();
            /*
             * STEP 8: Remove leading and tailing curly braces
             */
            eachWordExpanded = eachWordExpanded.replaceAll("\\{", "");
            eachWordExpanded = eachWordExpanded.replaceAll("\\}", "");

            /*
             * STEP 9: Now, iterate over each interoperation.
             */
            for(String eachInteroperation : allInterpretations){
                /*
                 * STEP 10: Remove the leading and tailing spaces
                 * from each interoperations.
                 */
                eachInteroperation = eachInteroperation.trim();

                /*
                 * STEP 11: Now append '>' to end of each interoperation
                 * because we'd split each of them at '>' previously.
                 */
                eachInteroperation = eachInteroperation + ">";

                /*
                 * STEP 12: Check if each eordExpanded contains any of the
                 * interoperation. 
                 */
                if(eachWordExpanded.contains(eachInteroperation)){

                    /*
                     * STEP 13: If each interoperation contains
                     * 'word', goto STEP 14.
                     * ELSE goto STEP 18.
                     */
                    if(eachInteroperation.contains("words")){
                        /*
                         * STEP 14: Remove that interoperation from the
                         * each wordExpanded String.
                         * 
                         * Ex: if the interoperation is <number_type_2 words>
                         * and it is found in the wordExpanded, remove it.
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll(eachInteroperation, "");
                        /*
                         * STEP 15: Now change the interoperation to digits.
                         * Ex: IF the interoperation is <number_type_2 words>,
                         * change that to <number_type_2 digits> and also remove them.
                         */
                        eachInteroperation = eachInteroperation.replaceAll("words", "digits");
                        eachWordExpanded = eachWordExpanded.replaceAll(eachInteroperation, "");

                        /*
                         * STEP 16: Remove leading and tailing square braces
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll("\\[", "");
                        eachWordExpanded = eachWordExpanded.replaceAll("\\]", "");

                        /*
                         * STEP 17: Remove any numbers in the form ( 3 ),
                         * since we are dealing with words.
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll("[(0-9)+]", "");
                        eachWordExpanded = eachWordExpanded.replaceAll("(\\s)+", " ");
                    }else{
                        /*
                         * STEP 18: Remove the interoperation just like STEP 14.
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll(eachInteroperation, "");
                        /*
                         * STEP 19: Now, change interoperations to words just like STEP 15,
                         * since we are dealing with digits here and then, remove it from the
                         * each wordExpanded String.
                         */
                        eachInteroperation = eachInteroperation.replaceAll("digits", "words");
                        eachWordExpanded = eachWordExpanded.replaceAll(eachInteroperation, "");

                        /*
                         * STEP 20: Remove the leading and tailing square braces.
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll("\\[", "");
                        eachWordExpanded = eachWordExpanded.replaceAll("\\]", "");
                        /*
                         * STEP 21: Remove the words in the form '(thirty four)'
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll("[(A-Za-z)+]", "");
                        eachWordExpanded = eachWordExpanded.replaceAll("\\s", "");
                    }
                }else{
                    continue;
                }
            }
            /*
             * STEP 22: Build your result object
             */
            resultBuilder.append(eachWordExpanded + "|");
        }
        /*
         * FINAL RESULT
         */
        System.out.println(resultBuilder.toString());
}

The equivalent Java-8 style is as below:

public static void parseString(String wordsExpanded, String interoperation) {
        interoperation= interoperation.replaceAll("\\{", "");
        interoperation = interoperation.replaceAll("\\}", "");

        String[] allInterpretations = interoperation.split(">");

        StringJoiner joiner = new StringJoiner("");
        Set<String> allInterOperations = Arrays.asList(interoperation.split(">"))
            .stream()
            .map(eachInterOperation -> {
            eachInterOperation = eachInterOperation.trim();
            eachInterOperation = eachInterOperation + ">";
            return eachInterOperation;
        }).collect(Collectors.toSet());

        String result = Arrays.asList(wordsExpanded.split("\\|"))
        .stream()
        .map(eachWordExpanded -> {
        eachWordExpanded = eachWordExpanded.trim();
        eachWordExpanded = eachWordExpanded.replaceAll("\\{", "");
        eachWordExpanded = eachWordExpanded.replaceAll("\\}", "");

        for(String eachInterOperation : allInterOperations){
            if(eachWordExpanded.contains(eachInterOperation)){
                if(eachInterOperation.contains("words")){
                    eachWordExpanded = eachWordExpanded.replaceAll(eachInterOperation, "");
                    eachInterOperation = eachInterOperation.replaceAll("words", "digits");
                    eachWordExpanded = eachWordExpanded.replaceAll(eachInterOperation, "");
                    eachWordExpanded = eachWordExpanded.replaceAll("\\[", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("\\]", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("[(0-9)+]", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("(\\s)+", " ");
                }else{
                    eachWordExpanded = eachWordExpanded.replaceAll(eachInterOperation, "");
                    eachInterOperation = eachInterOperation.replaceAll("digits", "words");
                    eachWordExpanded = eachWordExpanded.replaceAll(eachInterOperation, "");
                    eachWordExpanded = eachWordExpanded.replaceAll("\\[", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("\\]", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("[(A-Za-z)+]", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("\\s", "");
                }
            }else{
                continue;
            }
        }
        return eachWordExpanded;
    }).collect(Collectors.joining("|"));

    System.out.println(result);
}

On running the following tests on the above method with different interoperation Strings like:

{<number_type_2 words> <number_type_1 words> <number_type_0 words>}
{<number_type_2 digits> <number_type_1 words> <number_type_0 words>}
{<number_type_2 digits> <number_type_1 digits> <number_type_0 digits>}
{<number_type_2 words> <number_type_1 digits> <number_type_0 digits>}

Will produce the result like (Java-7 Result):

test|is|thirty four |test|three |one |
test|is|thirty four |test|three |1|
test|is|34|test|3|1|
test|is|34|test|3|one |

(Java-8 Result)

test|is|thirty four|test|three|one
test|is|thirty four|test|three|1
test|is|34|test|3|1
test|is|34|test|3|one

I hope this is what you were trying to achieve.

Shyam Baitmangalkar
  • 1,075
  • 12
  • 18
  • That is great! Thanks, that said, some of the grammar string could have "one" "two" "three" ... as digits as opposed to "1" "2" "3". String would look like this ""test is [(thirty four) {}( three four ) {}]" – Andre Couture Feb 13 '17 at 17:39
  • Well, in that case, within your digit loop.i.e in the `else` part, remove this line `eachWordExpanded = eachWordExpanded.replaceAll("[(A-Za-z)+]", "");` and pass `eachWordExpanded` to a **word to digit converter / parser** which will convert your words to digits. More info on creating a word-digit parser can be found here: http://stackoverflow.com/questions/4062022/how-to-convert-words-to-a-number – Shyam Baitmangalkar Feb 14 '17 at 06:33
0

Thanks guys, Based on the code from Shyam I have made few changes to make it return exactly what I needed.

Here is my new code;

    public static String parseString(String grammar, String interoperation) {
        if (grammar==null || interoperation == null || interoperation.equals("{}"))
            return null;

        List<String> matchList = new ArrayList<String>();
        Pattern regex = Pattern.compile("[^\\s\"'\\[]+|\\[([^\\]]*)\\]|'([^']*)'");
        Matcher regexMatcher = regex.matcher(grammar);
        while (regexMatcher.find()) {
            if (regexMatcher.group(1) != null) {
                matchList.add(regexMatcher.group(1));
            } else if (regexMatcher.group(2) != null) {
                matchList.add(regexMatcher.group(2));
            } else {
                matchList.add(regexMatcher.group());
            }
        } 

        String[] xx = matchList.toArray(new String[0]);
        String wordsExpanded = String.join(" | ",xx);

        interoperation= interoperation.replaceAll("\\{", "")
                                        .replaceAll("\\}", "");

        Set<String> allInterOperations = Arrays.asList(interoperation.split(">"))
            .stream()
            .map(eachInterOperation -> {
            eachInterOperation = eachInterOperation.trim();
            eachInterOperation = eachInterOperation + ">";
            return eachInterOperation;
        }).collect(Collectors.toSet());

        String result = Arrays.asList(wordsExpanded.split("\\|"))
            .stream()
            .map(eachWordExpanded -> {
                eachWordExpanded = eachWordExpanded.trim();
                eachWordExpanded = eachWordExpanded.replaceAll("\\{", "");
                eachWordExpanded = eachWordExpanded.replaceAll("\\}", "");

                for(String eachInterOperation : allInterOperations){
                    if(eachWordExpanded.contains(eachInterOperation)){
                        Pattern pattern = Pattern.compile("(\\(.*?\\))\\s*(<.*?>)");
                        Matcher matcher = pattern.matcher(eachWordExpanded);
                        while (matcher.find()) {
                            if (matcher.group(2).equals(eachInterOperation)) 
                                eachWordExpanded = matcher.group(1).replaceAll("[\\(\\)]", "").trim();
                        }
                    }else{
                        continue;
                    }
                }
                return eachWordExpanded;
            }).collect(Collectors.joining("|"));

        return result;
    }   

}

The output is as follow;

Input:

interoperation="{<number_type_2 digits> <number_type_1 digits> <number_type_0 words>}";

grammar="test is [(thirty four) {<number_type_0 words>}( 3  4 ) {<number_type_0 digits>}] test [(three) {<number_type_1 words>}( 3 ) {<number_type_1 digits>}] [(one) {<number_type_2 words>}( 1 ) {<number_type_2 digits>}]";

test|is|thirty four|test|3|1

Input:

grammar="test is [(thirty four) {<number_type_0 words>}( three  four ) {<number_type_0 digits>}] test [(three) {<number_type_1 words>}( 3 ) {<number_type_1 digits>}] [(one) {<number_type_2 words>}( 1 ) {<number_type_2 digits>}]";

test|is|thirty four|test|3|1

Input:

interoperation="{<number_type_4 digits> <number_type_3 digits> <number_type_2 words> <number_type_1 words> <number_type_0 words>}";
grammar="test [(thirty four) {<number_type_0 words>}( 3  4 ) {<number_type_0 digits>}] test [(three) {<number_type_1 words>}( 3 ) {<number_type_1 digits>}] [(one) {<number_type_2 words>}( 1 ) {<number_type_2 digits>}]";

test|thirty four|test|three|one

Input:

grammar = "this is my test [(three hundred forty one) {<number_type_0 words>}( 3  4  1 ) {<number_type_0 digits>}] for [(twenty one) {<number_type_1 words>}( 2  1 ) {<number_type_1 digits>}] issues";
interoperation= "{<number_type_1 digits> <number_type_0 words>}";

this|is|my|test|three hundred forty one|for|2 1|issues