1

I need help to do this exact thing with a String in Java. The best way to explain for me is by using a example.

So, I want to extract skip bi-grams from two sentences (user's input) and then be able to compare each others in terms of resemblance.

Sentence #1 : "I love green apples." Sentence #2 : "I love red apples."

Also, there is a variable named "distance" that is used to get the distance between words. (It is not very important at the moment)

Results

The skip bi-grams extracted from Sentence #1 using a distance of 3 would be :

{I love}, {I green}, {I apples}, {love green}, {love apples}, {green apples}

(Total of 6 bi-grams)

The skip bi-grams extracted from Sentence #2 using a distance of 3 would be :

{I love}, {I red}, {I apples}, {love red}, {love apples}, {red apples}

(Total of 6 bi-grams)


So far I have thought using String[] to put split String sentences.

So my question is, what could be the code that would extract those bi-grams from sentences ?

Thanks in advance!

Ant2805
  • 13
  • 2

1 Answers1

0

Basically, you want to find all unique two word combinations from a sentence of words.

Here is one solution involving ArrayList:

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class Test {
    public static String[][] skipBigrams(String input) {
        String[] tokens = input.replaceAll("[^a-zA-Z ]", "").split("\\s+");
        return skipBigrams(tokens);
    }

    private static String[][] skipBigrams(String[] tokens) {
        List<String[]> bigrams = new ArrayList<>();
        for (int i = 0; i < tokens.length; i++) {
            for (int j = i + 1; j < tokens.length; j++) {
                bigrams.add(new String[]{tokens[i], tokens[j]});
            }
        }
        String[][] result = new String[bigrams.size()][2];
        result = bigrams.toArray(result);
        return result;
    }

    public static void main(String[] args) {
        String s1 = "I love green apples.";
        System.out.println(Arrays.deepToString(skipBigrams(s1)));
    }
}
Dan Zheng
  • 1,493
  • 2
  • 13
  • 22
  • Great ! Now I need to add the distance Inside the code. – Ant2805 Nov 24 '16 at 23:34
  • What do you mean by distance? Do you mean the number of bigrams that are different? – Dan Zheng Nov 24 '16 at 23:35
  • The distance is a variable that I must use in the code which is a user input. It is a int that determine the distance between the words in the sentence. – Ant2805 Nov 25 '16 at 00:50
  • For example, if I use a distance of 1 with, the bi-grams will look like this for the first sentence : {I love}, {love green}, {green apples} And for the second sentence : {I love}, {love red}, {red apples} – Ant2805 Nov 25 '16 at 00:56
  • You can implement distance by changing the loop conditions in skipBigrams(). Think about what it means to have a distance of 1, 2, or 3 in terms of loop conditions. – Dan Zheng Nov 25 '16 at 01:13