4

I want to create a Java regular expression to grab all words that start with a capital letter then capital or small letters, but those letters may contain accents.

Examples :

Where

Àdónde

Rápido

Àste

Can you please help me with that ?

Kai
  • 38,985
  • 14
  • 88
  • 103
Brad
  • 4,457
  • 10
  • 56
  • 93

3 Answers3

8

Regex:

\b\p{Lu}\p{L}*\b

Java string:

"(?U)\\b\\p{Lu}\\p{L}*\\b"

Explanation:

\b      # Match at a word boundary (start of word)
\p{Lu}  # Match an uppercase letter
\p{L}*  # Match any number of letters (any case)
\b      # Match at a word boundary (end of word)

Caveat: This only works correctly in very recent Java versions (JDK7); for others you may need to substitute a longer sub-regex for \b. As you can see here, you may need to use (kudos to @tchrist)

(?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))

for \b, so the Java string would look like this:

"(?:(?<=[\\pL\\pM\\p{Nd}\\p{Nl}\\p{Pc}\\[\\p{InEnclosedAlphanumerics}&&\\p{So}]\\])(?![\\pL\\pM\\p{Nd}\\p{Nl}\\p{Pc}\\[\\p{InEnclosedAlphanumerics}&&\\p{So}]\\])|(?<![\\pL\\pM\\p{Nd}\\p{Nl}\\p{Pc}\\[\\p{InEnclosedAlphanumerics}&&\\p{So}]\\])(?=[\\pL\\pM\\p{Nd}\\p{Nl}\\p{Pc}\\[\\p{InEnclosedAlphanumerics}&&\\p{So}]\\]))\\p{Lu}\\p{L}*(?:(?<=[\\pL\\pM\\p{Nd}\\p{Nl}\\p{Pc}\\[\\p{InEnclosedAlphanumerics}&&\\p{So}]\\])(?![\\pL\\pM\\p{Nd}\\p{Nl}\\p{Pc}\\[\\p{InEnclosedAlphanumerics}&&\\p{So}]\\])|(?<![\\pL\\pM\\p{Nd}\\p{Nl}\\p{Pc}\\[\\p{InEnclosedAlphanumerics}&&\\p{So}]\\])(?=[\\pL\\pM\\p{Nd}\\p{Nl}\\p{Pc}\\[\\p{InEnclosedAlphanumerics}&&\\p{So}]\\]))"
Community
  • 1
  • 1
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • 1
    Sometimes I wish I could up-vote an answer more than once. This is one of them. – Hovercraft Full Of Eels Sep 25 '11 at 20:59
  • I think for the letter thing, you probably want letters or marks, and for uppercase letters you might want titlecase ones, too. So perhaps `(?U)\b[\p{Lu}\p{Lt}][\pL\pM]*\b`. With some datasets you might also want `\p{upper}` which is the Unicode binary Uppercase property, as that covers a bit more than just letters, such as Roman numerals. – tchrist Sep 26 '11 at 00:03
  • \p{L}* may not match numbers – Kenston Choi Jul 22 '13 at 02:58
1

Code for to detect the Capital Letters in a given para. in this case input given as Console Input.

import java.io.*;
import java.util.regex.*;
import java.util.Scanner;

public class problem9 {

    public static void main(String[] args) {
    String line1;
    Scanner in = new Scanner(System.in);
    String pattern = "(?U)\\b\\p{Lu}\\p{L}*\\b";

    line1 = in.nextLine();
    String delimiter = "\\s";   
    String[] words1 = line1.split(delimiter);

    for(int i=0; i<words1.length;i++){
        if(words1[i].matches(pattern)){
        System.out.println(words1[i]);
        }    
    }

  }
 }

If you give the Input something like

Input:This is my First Program

output:

This

First

Program

agiles
  • 1,711
  • 3
  • 17
  • 18
0

You can do it without regular expression. Verify the first letter in each word by transforming it to lower case and then check equality:

        String firstLetter = String.valueOf(seq[i].charAt(0));
        String lowerCase = firstLetter.toLowerCase();
        if (!firstLetter.equals(lowerCase))
            System.out.println(seq[i]);
   

It will work with any accent.

Valeriy K.
  • 2,616
  • 1
  • 30
  • 53