9

I'm finding this fairly hard to explain, so I'll kick off with a few examples of before/after of what I'd like to achieve.

Example of input:

Hello.World

This.Is.A.Test

The.S.W.A.T.Team

S.W.A.T.

s.w.a.t.

2001.A.Space.Odyssey

Wanted output:

Hello World

This Is A Test

The SWAT Team

SWAT

swat

2001 A Space Odyssey

Essentially, I'd like to create something that's capable of splitting strings by dots, but at the same time handles abbreviations.

My definition of an abbreviation is something that has at least two characters (casing irrelevant) and two dots, i.e. "A.B." or "a.b.". It shouldn't work with digits, i.e. "1.a.".

I've tried all kinds of things with regex, but it isn't exactly my strong suit, so I'm hoping that someone here has any ideas or pointers that I can use.

Community
  • 1
  • 1
Michell Bak
  • 13,182
  • 11
  • 64
  • 121
  • 5
    What is your logic for determining an abbreviation vs. a word? In other words can you explain your real world criteria for determining this? Specifically your biggest edge case is probably around one-letter words `A` and `I`. – Mike Brant Jun 13 '13 at 23:26
  • Sorry, forgot to add that. Just added. – Michell Bak Jun 13 '13 at 23:28
  • 1
    I see your definition but am wondering if it should really be either start of line-letter-dot-letter-dot `^[A-Z]\.[A-Z]\.` or dot-letter-dot-letter-dot `\.[A-Z]\.[A-Z]\.` Do abbreviations have to be upper case? – Mike Brant Jun 13 '13 at 23:32
  • Preferably, both should be supported, i.e. abbreviations at the start of the string and the middle or end. It'd be great if it works with both lowercase and uppercase, but it isn't that important. – Michell Bak Jun 13 '13 at 23:45

2 Answers2

11

How about removing dots that need to disappear with regex, and then replace rest of dots with space? Regex can look like (?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$)).

String[] data = { 
        "Hello.World", 
        "This.Is.A.Test", 
        "The.S.W.A.T.Team",
        "S.w.a.T.", 
        "S.w.a.T.1", 
        "2001.A.Space.Odyssey" };

for (String s : data) {
    System.out.println(s.replaceAll(
            "(?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$))", "")
            .replace('.', ' '));
}

result

Hello World
This Is A Test
The SWAT Team
SwaT 
SwaT 1
2001 A Space Odyssey

In regex I needed to escape special meaning of dot characters. I could do it with \\. but I prefer [.].

So at canter of regex we have dot literal. Now this dot is surrounded with (?<=...) and (?=...). These are parts of look-around mechanism called look-behind and look-ahead.

  • Since dots that need to be removed have dot (or start of data ^) and some non-white-space \\S that is also non-digit \D character before it I can test it using (?<=(^|[.])[\\S&&\\D])[.].

  • Also dot that needs to be removed have also non-white-space and non-digit character and another dot (optionally end of data $) after it, which can be written as [.](?=[\\S&&\\D]([.]|$))


Depending on needs [\\S&&\\D] which beside letters also matches characters like !@#$%^&*()-_=+... can be replaced with [a-zA-Z] for only English letters, or \\p{IsAlphabetic} for all letters in Unicode.

Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • That's perfect! And it's pretty close to what I was able to come up with on my own. Need to work on my regex :-) Thanks! – Michell Bak Jun 13 '13 at 23:49
  • 1
    Would you mind explaining the regex? Can't seem to find the exact functionality of < and =. Might help others, too :-) – Michell Bak Jun 13 '13 at 23:54
  • Thanks! I was pretty sure it was possible to do that, but I couldn't find it documented. Regex is awesome! – Michell Bak Jun 14 '13 at 00:04
  • 1
    @MichellBak `(?=...)` and `(?!...)` are positive and negative lookahead (match this position if `...` starts here), respectively. `(?<=...)` and `(?<!...)` are their lookbehind counterparts (match this position if `...` ends here). And yes, this should be explained in the answer. – John Dvorak Jun 16 '13 at 14:14
  • 1
    @JanDvorak Thanks a lot! He did add an explanation after I posted the comment :-) – Michell Bak Jun 16 '13 at 14:16
  • @MichellBak In your question you are mentioning that `It shouldn't work with digits, i.e. "1.a."` Could you give few examples of input and expected output since I am not sure if my answer supports this condition. – Pshemo Jun 16 '13 at 16:18
  • @Pshemo Sure! Basically something like "1.2." or "1.a." shouldn't be considered an abbreviation. To give you an example using the S.W.A.T. example above, "S.W.A.T.1." should output "SWAT 1". – Michell Bak Jun 16 '13 at 16:46
  • 1
    @MichellBak I updated my answer a little to change `\\S` into `[\\S&&\\D]` which means class of non-space characters that are also non-digits. Could you check if that is working exactly as you need? If there are cases that aren't correct please let me know :) – Pshemo Jun 16 '13 at 17:10
  • 1
    @Pshemo It passes my roughly 500 unit tests for the algorithm, so I'd say it's golden. Thanks again! I've added a bounty that I'll give you in when I can. – Michell Bak Jun 16 '13 at 17:56
0

Since every word starts with a capital (uppercase) letter, I would suggest that you first remove all dots, and replace it with no space (""). Then, iterate over all characters and put space between lowercase letter and following uppercase letter. Also, if you encounter an uppercase with following lowercase, put the space before the uppercase.

It will work for all examples you provided, but I am not sure if there are any exceptions to my observation.

darijan
  • 9,725
  • 25
  • 38
  • Sorry, forgot to add that it should work with both lower and uppercase characters. Added now. – Michell Bak Jun 13 '13 at 23:30
  • Not a problem. Do a preprocessing. Iterate over all characters and put every character after the dot to uppercase. Make every other lowercase. – darijan Jun 13 '13 at 23:33