1

Is there library that has a routine for truncating a string after n words? I'm looking for something that can turn:

truncateAfterWords(3, "hello, this\nis a long sentence");

into

"hello, this\nis"

I could write it myself, but I thought that something like this might already exist in some open source string manipulation library.


Here is a full list of test cases that I would expect any solution to pass:

import java.util.regex.*;

public class Test {

    private static final TestCase[] TEST_CASES = new TestCase[]{
        new TestCase(5, null, null),
        new TestCase(5, "", ""),
        new TestCase(5, "single", "single"),
        new TestCase(1, "single", "single"),
        new TestCase(0, "single", ""),
        new TestCase(2, "two words", "two words"),
        new TestCase(1, "two words", "two"),
        new TestCase(0, "two words", ""),
        new TestCase(2, "line\nbreak", "line\nbreak"),
        new TestCase(1, "line\nbreak", "line"),
        new TestCase(2, "multiple  spaces", "multiple  spaces"),
        new TestCase(1, "multiple  spaces", "multiple"),
        new TestCase(3, " starts with space", " starts with space"),
        new TestCase(2, " starts with space", " starts with"),
        new TestCase(10, "A full sentence, with puncutation.", "A full sentence, with puncutation."),
        new TestCase(4, "A full sentence, with puncutation.", "A full sentence, with"),
        new TestCase(50, "Testing a very long number of words in the testcase to see if the solution performs well in such a situation.  Some solutions don't do well with lots of input.", "Testing a very long number of words in the testcase to see if the solution performs well in such a situation.  Some solutions don't do well with lots of input."),
    };

    public static void main(String[] args){
        for (TestCase t: TEST_CASES){
            try {
                String r = truncateAfterWords(t.n, t.s);
                if (!t.equals(r)){
                    System.out.println(t.toString(r));
                }
            } catch (Exception x){
                System.out.println(t.toString(x));
            }       
        }   
    }

    public static String truncateAfterWords(int n, String s) {
        // TODO: implementation
        return null;
    }
}


class TestCase {
    public int n;
    public String s;
    public String e;

    public TestCase(int n, String s, String e){
        this.n=n;
        this.s=s;
        this.e=e;
    }

    public String toString(){
        return "truncateAfterWords(" + n + ", " + toJavaString(s) + ")\n  expected: " + toJavaString(e);
    }

    public String toString(String r){
        return this + "\n  actual:   " + toJavaString(r) + "";
    }

    public String toString(Exception x){
        return this + "\n  exception: " + x.getMessage();
    }    

    public boolean equals(String r){
        if (e == null && r == null) return true;
        if (e == null) return false;
        return e.equals(r);
    }   

    public static final String escape(String s){
        if (s == null) return null;
        s = s.replaceAll("\\\\","\\\\\\\\");
        s = s.replaceAll("\n","\\\\n");
        s = s.replaceAll("\r","\\\\r");
        s = s.replaceAll("\"","\\\\\"");
        return s;
    }

    private static String toJavaString(String s){
        if (s == null) return "null";
        return " \"" + escape(s) + "\"";
    }
}

There are solutions for this on this site in other languages:

Community
  • 1
  • 1
Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
  • 1
    I don't think there is a functionality like this, looks like something very particular. – Luiggi Mendoza Apr 11 '13 at 17:44
  • You can use split(), split words at " ", and then count them and when they exceed 3, discard the rest. But no, I have never come across anything like this already made. – Nico Apr 11 '13 at 17:44
  • I thought about split, but it tends to throw away the thing you split on. I want to preserve the spaces and new lines in the string. – Stephen Ostermiller Apr 11 '13 at 17:45
  • instead of using `String.spilt()`, i would prefer to use `Scanner` class `next()`. As the `spilt()` . Read more for this [link](http://stackoverflow.com/questions/736654/javas-scanner-vs-string-split-vs-stringtokenizer-which-should-i-use) – ajduke Apr 11 '13 at 17:58
  • My answer below will work fine with your edited input string `hello, this\nis a long sentence` as well. – anubhava Apr 11 '13 at 18:24
  • @StephenOstermiller: If it works then don't forget to mark it accepted whenever you can :P – anubhava Apr 11 '13 at 19:14

4 Answers4

4

You can use a simple regex based solution:

private String truncateAfterWords(int n, String str) {
   return str.replaceAll("^((?:\\W*\\w+){" + n + "}).*$", "$1");    
}

Live Demo: http://ideone.com/Nsojc7

Update: Based on your comments to resolve performance issues:

Use following method for faster performance while dealing with large # of words:

private final static Pattern WB_PATTERN = Pattern.compile("(?<=\\w)\\b");

private String truncateAfterWords(int n, String s) {
   if (s == null) return null;
   if (n <= 0) return "";
   Matcher m = WB_PATTERN.matcher(s);
   for (int i=0; i<n && m.find(); i++);
   if (m.hitEnd())
      return s;
   else
      return s.substring(0, m.end());
}
Community
  • 1
  • 1
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Unfortunately, the performance of this solution is problematic. Here is a test case that appears to go into an infinite loop: `truncateAfterWords(50, "Testing test testing as a test of testing testing more test.")` – Stephen Ostermiller Apr 16 '13 at 18:48
  • That doesn't compile -- start is not defined. I thought you might have meant m.sart() instead, but that throws an exception when it terminates because no more matches were found. – Stephen Ostermiller Apr 16 '13 at 23:46
  • I got a version similar to your second solution working and posted it as a solution here: http://stackoverflow.com/a/16049290/1145388 – Stephen Ostermiller Apr 17 '13 at 00:07
  • Oh sorry, it was very late night for me, actually it was supposed to be `m.end()`. Made another edit, pls check it now. – anubhava Apr 17 '13 at 03:03
  • I included a full set of test cases in the question with a test harness. This solution still fails several of them throwing a "No match available" exception. – Stephen Ostermiller Apr 17 '13 at 08:39
  • @StephenOstermiller: I think this practice of un-accepting an answer based on post edits is unfair since my original answer was for your original question. In case you want to expand your problem you can do so by creating a new question with a reference to this question. – anubhava Apr 17 '13 at 08:42
  • I was looking for an answer that I could actually use in an application. Not a solution that worked in a narrow set of cases. It turns out that this is not as easy of a question as it would appear on the surface, and almost all of the proposed solutions have been problematic as I have discovered when I put real world data in. I appreciate your willingness to come back and improve your answer, but I can't accept an answer that doesn't actually work in all cases. – Stephen Ostermiller Apr 17 '13 at 08:48
  • It is of course your prerogative to accept or not accept. I am just objecting to changing the problem itself. If problem is complex then describe it well upfront to get better targeted answer. btw I had tested my latest edited answer with all of your test cases and it appeared to work on all of them. – anubhava Apr 17 '13 at 17:06
2

I found a way to do it using the java.text.BreakIterator class:

private static String truncateAfterWords(int n, String s) {
    if (s == null) return null;
    BreakIterator wb = BreakIterator.getWordInstance();
    wb.setText(s);
    int pos = 0;
    for (int i = 0; i < n && pos != BreakIterator.DONE && pos < s.length();) {
        if (Character.isLetter(s.codePointAt(pos))) i++;
        pos = wb.next();
    }
    if (pos == BreakIterator.DONE || pos >= s.length()) return s;
    return s.substring(0, pos);
}
Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
0

Here is a version that uses regular expression to find the next set of spaces in a loop, until it has enough words. Similar to the BreakIterator solution, but with a regular expression to iterate over the word breaks.

// Any number of white space or the end of the input
private final static Pattern SPACES_PATTERN = Pattern.compile("\\s+|\\z");

private static String truncateAfterWords(int n, String s) {
    if (s == null) return null;
    Matcher matcher = SPACES_PATTERN.matcher(s);
    int matchStartIndex = 0, matchEndIndex = 0, wordsFound = 0;
    // Keep matching until enough words are found, 
    // reached the end of the string, 
    // or no more matches
    while (wordsFound<n && matchEndIndex<s.length() && matcher.find(matchEndIndex)){
        // Keep track of both the start and end of each match
        matchStartIndex = matcher.start();
        matchEndIndex = matchStartIndex + matcher.group().length();
        // Only increment words found when not at the beginning of the string
        if (matchStartIndex != 0) wordsFound++;
    }
    // From the beginning of the string to the start of the final match
    return s.substring(0, matchStartIndex);
}
Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
-1

Try using regular expressions in Java. The regex to retrieve only n words is: (.*?\s){n}.

Try using the code:

String inputStr= "hello, this\nis a long sentence";
Pattern pattern = Pattern.compile("(.*?[\\s]){3}", Pattern.DOTALL); 
Matcher matcher = pattern.matcher(inputStr);
matcher.find(); 
String result = matcher.group(); 
System.out.println(result);

To know more about packages:

Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
  • Good idea, but that regex doesn't work for me. This produces no output: `Matcher m = Pattern.compile("(.*?\\b){3}").matcher("hello, this is a long sentence");m.find();System.out.println(m.group(0));` – Stephen Ostermiller Apr 11 '13 at 17:59
  • Use this code @StephenOstermiller: It worked .... String inputStr= "hello, this is a long sentence"; Pattern pattern = Pattern.compile("(.*?[\\s\\n]){3}", Pattern.DOTALL); Matcher matcher = pattern.matcher(inputStr); matcher.find(); String result = matcher.group(); System.out.println(result); – Srivatsa Jenni Apr 11 '13 at 18:50
  • I wrote a full set of test cases and added it to the question. This solution fails several of them as well as goes into an infinite loop on long input. – Stephen Ostermiller Apr 17 '13 at 08:36
  • Sorry for the delayed response. Use regex as (.*[\\s\\n]{1,}). The previous model is an example for how to work out things but not an full fledged regex. Thanks – Srivatsa Jenni Apr 26 '13 at 11:58
  • There was a typo mistake .Use this regex (.*?[\\s\\n]){1,} – Srivatsa Jenni Apr 30 '13 at 09:15