0

I want to split a given sentence of type string into words and I also want punctuation to be added to the list.

For example, if the sentence is: "Sara's dog 'bit' the neighbor."
I want the output to be: [Sara's, dog, ', bit, ', the, neighbour, .]

With string.split(" ") I can split the sentence in words by space, but I want the punctuation also to be in the result list.

    String text="Sara's dog 'bit' the neighbor."  
    String list = text.split(" ")
    the printed result is [Sara's, dog,'bit', the, neighbour.]
    I don't know how to combine another regex with the above split method to separate punctuations also.

Some of the reference I have already tried but didn't work out

1.Splitting strings through regular expressions by punctuation and whitespace etc in java

2.How to split sentence to words and punctuation using split or matcher?

Example input and outputs

String input1="Holy cow! screamed Jane."

String[] output1 = [Holy,cow,!,screamed,Jane,.] 

String input2="Select your 'pizza' topping {pepper and tomato} follow me."

String[] output2 = [Select,your,',pizza,',topping,{,pepper,and,tomato,},follow,me,.]
Aziz.G
  • 3,599
  • 2
  • 17
  • 35
Sabarinathan
  • 439
  • 1
  • 7
  • 19
  • 2
    One solution is to write a custom function to do this. – Code-Apprentice Sep 09 '19 at 16:55
  • do you have any reference for a sample like this? – Sabarinathan Sep 09 '19 at 17:01
  • No reference is needed. You have to come up with it yourself. If I were solving this problem, I would start by turning off my computer. Then I would get a notebook and a pen and write down **in words** the steps I need to take to solve the problem. Once I have a clear idea of those steps, then I would translate those words into Java. – Code-Apprentice Sep 09 '19 at 17:03
  • Translating from a language of people to a language of the machine is a large part of the job of a computer programmer. This requires the first step of explaining the solution in natural human language. – Code-Apprentice Sep 09 '19 at 17:05
  • Ii both the example it becomes three elements. I have edited the question. 'bit' will become [',bit,'] also the word 'pizza' will become [',pizza,'] – Sabarinathan Sep 09 '19 at 17:26
  • @Abra In the first part, *'bit'` is also 3 elements. You're incorrectly looking at the code block, which shows OP failed attempts, not a desired outcome. – Andreas Sep 09 '19 at 17:26

3 Answers3

0

Instead of trying to come up with a pattern to split on, this challenge is easier to solve by coming up with a pattern of the elements to capture.

Although it's more code than a simple split(), it can still be done in a single statement in Java 9+:

String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);

In Java 8 or earlier, you would write it like this:

List<String> parts = new ArrayList<>();
Matcher m = Pattern.compile(regex).matcher(s);
while (m.find()) {
    parts.add(m.group());
}

Explanation

\p{L} is Unicode letters, \\p{N} is Unicode numbers, and \\p{M} is Unicode marks (e.g. accents). Combined, they are here treated as characters in a "word".

\p{P} is Unicode punctuation. A "word" can have single punctuation characters embedded inside the word. The pattern before | matches a "word", given that definition.

\p{S} is Unicode symbol. Punctuation that is not embedded inside a "word", and symbols, are matched individually. That is the pattern after the |.

That leaves Unicode categories Z (separator) and C (other) uncovered, which means that any such character is skipped.

Test

public class Test {
    public static void main(String[] args) {
        test("Sara's dog 'bit' the neighbor.");
        test("Holy cow! screamed Jane.");
        test("Select your 'pizza' topping {pepper and tomato} follow me.");
    }
    private static void test(String s) {
        String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
        String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);
        System.out.println(Arrays.toString(parts));
    }
}

Output

[Sara's, dog, ', bit, ', the, neighbor, .]
[Holy, cow, !, screamed, Jane, .]
[Select, your, ', pizza, ', topping, {, pepper, and, tomato, }, follow, me, .]
Andreas
  • 154,647
  • 11
  • 152
  • 247
0
Arrays.stream( s.split("((?<=[\\s\\p{Punct}])|(?=[\\s\\p{Punct}]))") )
.filter(ss -> !ss.trim().isEmpty())
.collect(Collectors.toList())

Reference:

How to split a string, but also keep the delimiters?

Regular Expressions on Punctuation

ckedar
  • 1,859
  • 4
  • 7
-1
ArrayList<String> chars = new ArrayList<String>();
String str = "Hello my name is bob";
String tempStr = "";
for(String cha : str.toCharArray()){
  if(cha.equals(" ")){
    chars.add(tempStr);
    tempStr = "";
  }
  //INPUT WHATEVER YOU WANT FOR PUNCTATION WISE
  else if(cha.equals("!") || cha.equals(".")){
    chars.add(cha);
  }
  else{
    tempStr = tempStr + cha;
  }
}
chars.add(str.substring(str.lastIndexOf(" "));

That? It should add every single word, assuming there is spaces for each word in the sentence. for !'s, and .'s, you would have to do a check for that as well. Quite simple.