0

I am was trying to split paragraph to sentences. The paragraph can have a word like F.C.B also it includes some html tag like anchor and other tags. I was trying to use like below but it was not perfect separating my paragraph to the specific sentence by living the html tag as it is.

String.split("(?<!\\.[a-zA-Z])\\.(?![a-zA-Z]\\.)(?![<[^>]*>])");  

Please is there anyone which can help me with a better regular expression or any idea?

MWiesner
  • 8,868
  • 11
  • 36
  • 70
user3259978
  • 33
  • 1
  • 5
  • I also tryied this but didn't work: String.split("(?<!\\.[a-zA-Z])\\.(?![a-zA-Z]\\.)(?![<[^>]*>])"); – user3259978 Jun 08 '16 at 21:36
  • 2
    **Use BreakIterator** , Probably duplicate question . [Explains best about your question](http://stackoverflow.com/questions/2687012/split-string-into-sentences) – Vivek Vashishta Jun 08 '16 at 21:58

2 Answers2

1

you can try this:

String par = "In 2004, Obama received national attention during his campaign to represent Illinois in the United States Senate with his victory in the March Democratic Party primary, his keynote address at the Democratic National Convention in July, and his election to the Senate in November. He began his presidential campaign in 2007 and, after a close primary campaign against Hillary Clinton in 2008, he won sufficient delegates in the Democratic Party primaries to receive the presidential nomination.";
Pattern pattern = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher matcher = pattern.matcher(par);
while (matcher.find()) {
    System.out.println(matcher.group());
}

let me know if it works

Seek Addo
  • 1,871
  • 2
  • 18
  • 30
  • @RoYoMin are you having problem withe ")" you can escape them or ignore with the special escape – Seek Addo Jun 09 '16 at 06:59
  • @RoYoMi it should work, just that your text contains html and the res see here [link here](https://regex101.com/r/oS4sX2/4). do you need the html in your text to format it? – Seek Addo Jun 09 '16 at 07:07
1

Description

Rather than splitting on the characters, it would be easier to just match and capture each sentence substring

(?:<(?:(?:[a-z]+\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?\/?|\/[a-z]+)>)|(?:(?!<)(?:[^.?!]|[.?!](?=\S)))*)+[.?!]

Regular expression visualization

This regular expression will do the following:

  • Match each sentence
  • allow substrings like F.C.B
  • ignore html tags, but include them in the capture

Note: You'll need to escape all the \ so they look like \\

Example

Live Demo

https://regex101.com/r/fJ9zS0/3

Sample text

I am was trying to split paragraph to sentences. The paragraph can have a word like F.C.B also it includes some html tag like anchor and other tags. I was trying to use like below but it was not perfect separating my paragraph to the specific sentence by living the html tag as it is.

In 2004, he <a href="http://test.pic.org/jpeg."> received </a> national attention during his Party primary, his keynote address July, <a onmouseover=" fnRotator('I like droids. '); "> and </a> his election to the Senate in November. He began his presidential campaign in he won sufficient delegates in the Democratic Party primaries to receive the presidential nomination.

Sample Matches

Java Code Example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = " ----your source string goes here----- ";
  Pattern re = Pattern.compile("(?:<(?:(?:[a-z]+\\s(?:[^>=]|='[^']*'|=\"[^\"]*\"|=[^'\"\\s]*)*\"\\s?\\/?|\\/[a-z]+)>)|(?:(?!<)(?:[^.?!]|[.?!](?=\\S)))*)+[.?!]",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

Sample Output

$matches Array:
(
    [0] => Array
        (
            [0] => I am was trying to split paragraph to sentences.
            [1] =>  The paragraph can have a word like F.C.B also it includes some html tag like anchor and other tags.
            [2] =>  I was trying to use like below but it was not perfect separating my paragraph to the specific sentence by living the html tag as it is.
            [3] => 

In 2004, he <a href="http://test.pic.org/jpeg."> received </a> national attention during his Party primary, his keynote address July, <a onmouseover=" fnRotator('I like droids. '); "> and </a> his election to the Senate in November.
            [4] =>  He began his presidential campaign in he won sufficient delegates in the Democratic Party primaries to receive the presidential nomination.
        )
    )

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    <                        '<'
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        [a-z]+                   any character of: 'a' to 'z' (1 or
                                 more times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
        (?:                      group, but do not capture (0 or more
                                 times (matching the most amount
                                 possible)):
----------------------------------------------------------------------
          [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
         |                        OR
----------------------------------------------------------------------
          ='                       '=\''
----------------------------------------------------------------------
          [^']*                    any character except: ''' (0 or
                                   more times (matching the most
                                   amount possible))
----------------------------------------------------------------------
          '                        '\''
----------------------------------------------------------------------
         |                        OR
----------------------------------------------------------------------
          ="                       '="'
----------------------------------------------------------------------
          [^"]*                    any character except: '"' (0 or
                                   more times (matching the most
                                   amount possible))
----------------------------------------------------------------------
          "                        '"'
----------------------------------------------------------------------
         |                        OR
----------------------------------------------------------------------
          =                        '='
----------------------------------------------------------------------
          [^'"\s]*                 any character except: ''', '"',
                                   whitespace (\n, \r, \t, \f, and "
                                   ") (0 or more times (matching the
                                   most amount possible))
----------------------------------------------------------------------
        )*                       end of grouping
----------------------------------------------------------------------
        "                        '"'
----------------------------------------------------------------------
        \s?                      whitespace (\n, \r, \t, \f, and " ")
                                 (optional (matching the most amount
                                 possible))
----------------------------------------------------------------------
        \/?                      '/' (optional (matching the most
                                 amount possible))
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        \/                       '/'
----------------------------------------------------------------------
        [a-z]+                   any character of: 'a' to 'z' (1 or
                                 more times (matching the most amount
                                 possible))
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
      >                        '>'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
        <                        '<'
----------------------------------------------------------------------
      )                        end of look-ahead
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        [^.?!]                   any character except: '.', '?', '!'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        [.?!]                    any character of: '.', '?', '!'
----------------------------------------------------------------------
        (?=                      look ahead to see if there is:
----------------------------------------------------------------------
          \S                       non-whitespace (all but \n, \r,
                                   \t, \f, and " ")
----------------------------------------------------------------------
        )                        end of look-ahead
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
  )+                       end of grouping
----------------------------------------------------------------------
  [.?!]                    any character of: '.', '?', '!'
----------------------------------------------------------------------
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43