-1

I have a string formatting questions which I think would be best to do with Regular Expressions. Therefore I was hoping I could get advice and help on putting together the set of regex and in which order so one would not cancel or override the other.

Here are the requirements:

1) I need to add only one blank space before and after punctuation signs such as ., ,, ;, :, !, ?, -, _, ....

So that the following sentence

"Instructions: Pay-attention! Will you? Except with respect to the information specifically incorporated by reference in this Form 10-K, the registrant's definitive proxy statement is not deemed to be filed as a part of this Form 10-K."

Will be:

"Pay - attention ! Will you ? Except with respect to the information specifically incorporated by reference in this Form 10 - K , the registrant's definitive proxy statement is not deemed to be filed as a part of this Form 10 - K ."

2) However, I want to preserve numbers and dollar signs as they are so for instance the number:

1,000.00 has to be 1,000.00 or if it is notated as 1.000,00 has to maintain the same without adding spaces.

Same goes to $1,000.00 which ought to be the same, so $1,000.00.

What is the easiest way to preserve numbers while making sure that the following punctuation marks ., ,, ;, :, !, ?, -, _, ... get a space before and after?

3) On top of that, the third requirement is to make sure that if you have more than 3 dots so ..... they have to be reduced to ... but if you have 2 dots .. it has to be reduced to just one dot ..

Pshemo
  • 122,468
  • 25
  • 185
  • 269
Carlos Antunes
  • 135
  • 3
  • 15
  • 1
    Replace `(\D)([.,;:!?_-])(\D)` with `\1 \2 \3`, it will ignore the symbols that are surrounded by digits. `\.{2}` goes to `.`, and `\.{3,}` goes to `...`. In summary, you'll probably need three separate regexes. – RevanProdigalKnight Jul 25 '14 at 13:17
  • Revan, Thanks but the first rule does not work, although the dots rules are great. System.out.println("Original: "+sentence); sentence = sentence.replaceAll("\\.{3,}"," ... "); sentence = sentence.replaceAll("\\.{2}"," . "); sentence = sentence.replaceAll("(\\D)([,;:!?_-])(\\D)", "\\1 \\2 \\3"); System.out.println("Filtered: "+sentence); Maybe had done something worn because on the Java version of the REGEX it should print the punctuation character. Instead it prints 1 2 ou 3. – Carlos Antunes Jul 25 '14 at 14:55
  • @user3799994 Hmmm, I didn't see your response to my comment until just now (apologies for the delay). Try replacing with `$1 $2 $3` instead. I forgot that Java uses the `$` instead of `\ ` in regex replacement patterns. – RevanProdigalKnight Aug 02 '14 at 02:09

2 Answers2

0

this code is written on c# i hope it will be same on java too

string result = Regex.Replace(input, @"([a-zA-Z0-9])(\p{P})", "$1 $2");
result = Regex.Replace(result, @"(\p{P})([a-zA-Z0-9])", "$1 $2");
//result = Regex.Replace(result, @"\s+", " ");
result = Regex.Replace(result, @"(\d)\s(\p{P})\s(\d)", "$1$2$3");
result = Regex.Replace(result, @"\.{2}", ".");
result = Regex.Replace(result, @"\.{3,}", "..");

--SJ

codeninja.sj
  • 3,452
  • 1
  • 20
  • 37
0
First off, thanks for the help.

    We have a few issues though, the solution from PShemo for numbers is right on! So thanks for that. Meaning the solution to remove added spaces if they are numbers.

    But we need something like that for other situations as I describe as follows.

    However the issues with the dots cancel each other. So if you try to replace a lot of dots with three dots, then great. But if you run the replacement it then gets . . .

    The code I have is as follows:

    original = original.replaceAll("([a-zA-Z0-9])(\\p{P})", "$1 $2");
            original = original.replaceAll("(\\p{P})([a-zA-Z0-9])", "$1 $2");
            original = original.replaceAll("(\\d)\\s(\\p{P})\\s(\\d)", "$1$2$3");
            original = original.replaceAll("\\.{3,}", "..");
            original = original.replaceAll("\\.{2}", ".");
            original = original.replaceAll(" %","%");
            original = original.replaceAll(" - ","-");
            original = original.replaceAll(" ' ","'");

    Problems are:

    1) Emails, http links and phone numbers get spaces on @, (, ), :, / etc.

    So ideally the p{P} is not good as we can only do : if not a http link. WE cannot do %, -, ' with space as well hence the last 3 lines to fix it back. Therefore we only want spaces on the end of questions like !, ? and period (if not abbreviation or numbers). We want spaces on commas (if not part of number formatting) and we want spaces on colon : if not part of an http URL. Hence this is the complication factor.

    2) The goal, with period/dot, is to have a space on a period that ends a sentence so "This is the end . " rather than "This is the end." But abbreviations like "U.S.A." cannot become "U . S . A ."

    3) I want that more than 3 dots (.....) become ...., more than 2 dots become one dot so ".." becomes "." but the rules above cancel one another. 

    So it looks like that to fix email (@ and dots), URLs (: / dots) we could have a rule like the one for numbers "(\\d)\\s(\\p{P})\\s(\\d)", "$1$2$3" so that eventual space is removed.

    According to the RFC 282 the rules for a correct email address is : "(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"

    Now for phone numbers, you can have the following situations:

    1)###-###-####
    2)#-###-###-####
    3)###-####
    4)##########
    5)#######
    6) (xxx) xxx-xxxx
    7) (xx) xxxx-xxxx

    And the list from the conventions here: http://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers

    The issue with phone numbers on happen if there is punctuation (as we are adding spaces) such as -, (, ), +. Other than that fine.

    I found this code on Stackoverflow for phone numbers too:


    http://stackoverflow.com/questions/3367843/phone-number-regex-for-multiple-patterns-in-java

    public int Phone(String num)
    {
        try
        {
        String expression = "^(?=.{7,32}$)(\\(?\\+?[0-9]*\\)?)?[0-9_\\- \\(\\)]*((\\s?x\\s?|ext\\s?|extension\\s?)\\d{1,5}){0,1}$";  
        CharSequence inputStr = num;  
        Pattern pattern = Pattern.compile(expression);  
        Matcher matcher = pattern.matcher(inputStr);
        int x=0,y=0;
        char[] value=num.toCharArray();
        for(int i=0;i<value.length;i++)
        {
            if(value[i]=='(')
                x++;
            if(value[i]==')'&&((value[i+1]>=48&&value[i+1]<=57)||value[i+1]=='-'))
                y++;
        }
       if(matcher.matches()&&x==y)
          return 1; //valid number
       else
          return 0; //invalid number
        }
        catch(Exception ex){return 0;}
     }



    }

This here will remove dots in acronyms but not in URIs:

http://stackoverflow.com/questions/1279110/whats-the-regex-for-removing-dots-in-acronyms-but-not-in-domain-names

----

http://stackoverflow.com/questions/17098834/split-string-with-dot-while-handling-abbreviations

How about removing dots that need to disappear with regex, and then replace rest of dots with space? Regex can look like (?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$)).

String[] data = { 
        "Hello.World", 
        "This.Is.A.Test", 
        "The.S.W.A.T.Team",
        "S.w.a.T.", 
        "S.w.a.T.1", 
        "2001.A.Space.Odyssey" };

for (String s : data) {
    System.out.println(s.replaceAll(
            "(?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$))", "")
            .replace('.', ' '));
}
result

Hello World
This Is A Test
The SWAT Team
SwaT 
SwaT 1
2001 A Space Odyssey
In regex I needed to escape special meaning of dot characters. I could do it with \\. but I prefer [.].

So at canter of regex we have dot literal. Now this dot is surrounded with (?<=...) and (?=...). These are parts of look-around mechanism called look-behind and look-ahead.

Since dots that need to be removed have dot (or start of data ^) and some non-white-space \\S that is also non-digit \D character before it I can test it using (?<=(^|[.])[\\S&&\\D])[.].

Also dot that needs to be removed have also non-white-space and non-digit character and another dot (optionally end of data $) after it, which can be written as [.](?=[\\S&&\\D]([.]|$))

Depending on needs [\\S&&\\D] which beside letters also matches characters like !@#$%^&*()-_=+... can be replaced with [a-zA-Z] for only English letters, or \\p{IsAlphabetic} for all letters in Unicode.
Carlos Antunes
  • 135
  • 3
  • 15