1

I am using apache open nlp toolkit in java.I wish to display only name enitites in a given text like geo-graphical, person etc.. Following code snippet gives string spans

try {
        System.out.println("Input : Pierre Vinken is 61 years old");
        InputStream modelIn = new FileInputStream("en-ner-person.bin");
        TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
        NameFinderME nameFinder = new NameFinderME(model);
        String[] sentence = new String[]{
                "Pierre",
                "Vinken",
                "is",
                "61",
                "years",
                "old",
                "."
                };

            Span nameSpans[] = nameFinder.find(sentence);
            for(Span s: nameSpans)
                System.out.println("Name Entity : "+s.toString());
    }
    catch (IOException e) {
      e.printStackTrace();
    }

Output :

Input : Pierre Vinken is 61 years old Name Entity : [0..2) person

How can i get the equivalent string rather than span, is there any api for that?

Akash
  • 85
  • 1
  • 12
  • You have the tokenized sentence and the list indices where the name appears. Just get the relevant slice of the list of tokens and [join it](http://stackoverflow.com/questions/1751844/java-convert-liststring-to-a-joind-string). – Dan Jan 22 '15 at 17:07
  • Is there any standard api to get slice ? I know I can get characters from 0 to 2 using a for loop – Akash Jan 22 '15 at 17:30
  • [boon](https://github.com/boonproject/boon/wiki/Boon-Slice-Notation-for-List,-Set,-Map,-and-primitive-arrays) seems to be the way to go. – Dan Jan 22 '15 at 23:34

2 Answers2

2

Span has the method getCoveredText(CharSequence text) which will do this. But I don't understand why you need an API method to get the text corresponding to a span. A span clearly provides start (inclusive) and end (exclusive) integer offsets. So the following suffices:

StringBuilder builder = new StringBuilder();
for (int i = s.getStart(); i < s.getEnd(); i++) {
    builder.append(sentences[i]).append(" ");
}
String name = builder.toString();
Chthonic Project
  • 8,216
  • 1
  • 43
  • 92
  • Ya I know that. But I was expecting a direct api to obtain a string =given a string like in python I can directly get string value by string slicing . – Akash Jan 27 '15 at 16:25
  • In general, if I need to use a method repeatedly, and for some reason it is not there in the API, I add it to my own utility class to save myself some trouble. E.g., you could write a class called `OpenNLPUtils`, and add methods. Over the years, I've found this to be quite useful, especially for larger NLP projects. – Chthonic Project Jan 27 '15 at 19:02
0

You can use the Span class itself.

The following class method returns the CharSequence that correspond to the Span instance from another CharSequence text:

/**
 * Retrieves the string covered by the current span of the specified text.
 *
 * @param text
 *
 * @return the substring covered by the current span
 */
public CharSequence getCoveredText(CharSequence text) { ... }

Notice that this class also has two static methods that accept an array of Span and respectively a CharSequence or an array of tokens (String[]) to return the equivalent array of String.

/**
 * Converts an array of {@link Span}s to an array of {@link String}s.
 *
 * @param spans
 * @param s
 * @return the strings
 */
public static String[] spansToStrings(Span[] spans, CharSequence s) {
    String[] tokens = new String[spans.length];

    for (int si = 0, sl = spans.length; si < sl; si++) {
        tokens[si] = spans[si].getCoveredText(s).toString();
    }

    return tokens;
}

public static String[] spansToStrings(Span[] spans, String[] tokens) { ... }

I hope it helps...

Stefano Bragaglia
  • 622
  • 1
  • 8
  • 25