2

So I wish to split a sentence into separate tokens. However, I don't want to get rid of certain punctuations that I wish to be part of tokens. For example, "didn't" should stay as "didn't" at the end of a word if the punctuation is not followed by a letter it should be taken out. So, "you?" should be converted to "you" same with the begining: "?you" should be "you".

String str = "..Hello ?don't #$you %know?";
    String[] strArray = new String[10];

    strArray = str.split("[^A-za-z]+[\\s]|[\\s]");
    //strArray[strArray.length-1]

    for(int i = 0; i < strArray.length; i++) {
        System.out.println(strArray[i] + i);
    }

This should just print out: hello0 don't1 you2 know3

Uluc Ozdenvar
  • 59
  • 1
  • 7
  • You have to explain explicitly all the rules. Probably you have two lists : one containing punctuation to keep in words (will contain quote) and another one containing punctuation to ignore (will contain question mark) – jaudo Jan 23 '19 at 16:11
  • This is something that would take a very long, convoluted regex. It would be better to write a parser, or use a parsing library. – ack Jan 23 '19 at 16:24
  • 1
    Possible duplicate of [Regular Expressions on Punctuation](https://stackoverflow.com/questions/11705112/regular-expressions-on-punctuation) – locus2k Jan 23 '19 at 16:39

1 Answers1

2

Rather than splitting, you should prefer to use find to find all the tokens as you want with this regex,

[a-zA-Z]+(['][a-zA-Z]+)?

This regex will only allow sandwiching a single ' within it. If you want to allow any other such character, just place it within the character set ['] and right now it will allow only once and in case you want to allow multiple times, you will have to change ? at the end with a * to make it zero or more times.

Checkout your modified Java code,

List<String> tokenList = new ArrayList<String>();
String str = "..Hello ?don't #$you %know?";
Pattern p = Pattern.compile("[a-zA-Z]+(['][a-zA-Z]+)?");
Matcher m = p.matcher(str);
while (m.find()) {
    tokenList.add(m.group());
}

String[] strArray = tokenList.toArray(new String[tokenList.size()]);

for (int i = 0; i < strArray.length; i++) {
    System.out.println(strArray[i] + i);
}

Prints,

Hello0
don't1
you2
know3

However, if you insist on using split method only, then you can use this regex to split the values,

[^a-zA-Z]*\\s+[^a-zA-Z]*|[^a-zA-Z']+

Which basically splits the string on one or more white space optionally surrounded by non-alphabet characters or split by sequence of one or more non-alphabet and non single quote character. Here is the sample Java code using split,

String str = "..  Hello ?don't #$you %know?";
String[] strArray = Arrays.stream(str.split("[^a-zA-Z]*\\s+[^a-zA-Z]*|[^a-zA-Z']+")).filter(x -> x.length()>0).toArray(String[]::new);

for (int i = 0; i < strArray.length; i++) {
    System.out.println(strArray[i] + i);
}

Prints,

Hello0
don't1
you2
know3

Notice here, I have used filter method on streams to filter tokens of zero length as split may generate zero length tokens at the start of array.

Pushpesh Kumar Rajwanshi
  • 18,127
  • 2
  • 19
  • 36
  • sorry about not mentioning in my initial question but how would one go about adding numbers to this regex just adding "0-9"? – Uluc Ozdenvar Jan 23 '19 at 21:49
  • @UlucOzdenvar: By "adding numbers" do you mean you want to retain numbers as well in addition to alphabets? You can include `\d` wherever you have `a-zA-Z` so your match based solution's regex becomes this `[a-zA-Z\d]+(['][a-zA-Z\d]+)?` – Pushpesh Kumar Rajwanshi Jan 24 '19 at 05:01
  • @UlucOzdenvar: Did you not find something in my answer you were looking for due to which you unaccepted my answer? May I help you with something that goes unsolved? – Pushpesh Kumar Rajwanshi Jan 24 '19 at 19:34