18

How would I go about detecting if all alphabetic characters in a string (of >= 2 characters) are upper case? Ultimately, I'm trying to filter out chapter title names, that are rows in my data-set. So if a chapter title is "ARYA", I want that detected, same as "THE QUEEN'S HAND".

Here's what I'm trying but doesn't work:

library(dplyr)
library(stringr)

str_detect("THE QUEEN’S HAND", "^[[:upper:]]{2,}+$")
#> FALSE

The requirements I need:

  • Number of characters >= 2 because I'm ultimately using this to filter out chapter names, but sometimes there's a row where the word is "I", but that's not a chapter -- it's just a word. Though this could be filtered at a different point
  • Only alphabetic characters or apostrophes detected. Sometimes the row is "...", which I don't want detected. However, if I use a toupper(x) == (x) solution, this would be detected alongside something like "THE QUEEN'S HAND". I'm also trying to get rid of anything with exclamation points or periods, like "STOP THIS!"
Evan O.
  • 1,553
  • 2
  • 11
  • 20
  • 1
    I am not sure, but I think you need to split the string `str_detect(str_split("THE QUEEN’S HAND"," ")[[1]], "^[[:upper:]]{2,}")` – PKumar May 23 '18 at 03:22
  • @RonakShah it's because sometimes, in my data, my string will only be the word "I", but I need that, so I assume that >= 2 characters will filter out situations like these – Evan O. May 23 '18 at 03:53
  • @EvanO., I added an explanation to my answer and expanded it to _optionally_ match non-English letters as well. – 41686d6564 stands w. Palestine May 23 '18 at 04:04
  • 17
    Why would you do this with regex? It can be done quickly, and trivially, by checking if `x == toupper(x)` in R – Jack Aidley May 23 '18 at 09:27
  • 3
    Are you basically just asking if the string has no lower-case letters? Assuming of course that you're working in a locale where "lowercase" and "uppercase" are trivial concepts, like in English. Do you need to have that length check as part of the regex, or can it be done outside of it? – ilkkachu May 23 '18 at 14:04
  • @JackAidley good point, though I could've been more clear here -- there are a few edge cases where my string is "...", which I don't want to be detected. `toupper("...")` is equal to `"..."` – Evan O. May 23 '18 at 14:20
  • 3
    You need to clarify what you want exactly. Do you want to make sure the string only contrains uppercase alphabetic characters and non-alphabetic characters (i.e. just lowercase characters are forbidden)? What's the story with the ">= 2 characters"? Does it mean the string must be at least 2 characters? That if it's only a single character it doesn't matter if it's uppercase? That the check only applies to 2 alphabetic character sequences in the string? What should be the result for the empty string, `a`, `ab`, `Ab`, `9`, `99`, `9a`, `9A`? What about accented characters? – jcaron May 23 '18 at 14:31
  • In that case it would really help if you clearly stated your full requirements rather than giving us an incomplete question. – Jack Aidley May 23 '18 at 14:48
  • 1
    good points. The only reason I did it without those requirements is so the question would be more easily generalizable for others' issues, but I guess not really. Edited it. – Evan O. May 23 '18 at 16:49
  • Is it exactly and only "..." you are looking for, or can other deviations from the general pattern occur? – Jack Aidley May 23 '18 at 16:50
  • Any non-alphabetic characters. So, an example of something I don't want detected is "STOP THIS! NOW!", which isn't a chapter title, but would be detected with `toupper(x)==x` – Evan O. May 23 '18 at 17:06
  • @EvanO. But "THE QUEEN'S HAND" also contains non-alphabetic characters, specifically a space and an '. – Jack Aidley May 23 '18 at 17:57
  • Ah good find. I thought that one worked, but I guess not. I gotta think about that more I guess. Now I realize that some punction (like ') is fine, but end-of-sentence punctuation (! or . ) isn't – Evan O. May 23 '18 at 18:23
  • Okay, I have updated my answer to reflect your new information but unless you can precisely define the question we cannot give a precise answer. – Jack Aidley May 28 '18 at 13:42

7 Answers7

17

Reverse your logic

all alphabetic characters are upper case.

is the same as

not a single alphabetic character is lower case.

Code

If you really want to use a regex for this task, all you need to write is :

! str_detect("THE QUEEN’S HAND", "[[:lower:]]")

You can test it here.

If you want to take the string length into account, you can add a logical OR :

nchar(str) < 2 || ! str_detect(str, "[[:lower:]]")

You can test it here.

Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
  • 1
    Don't forget to test that the string length is >1. Otherwise this and @jack aidley's answers are by far the simplest here. (+1) – Hong Ooi May 23 '18 at 14:52
  • I think there's also the requirement that there is at least 1 uppercase alphabetic character. If I read the question correctly, OP doesn't want to detect `...`, but your condition will match it. – Vilx- May 23 '18 at 17:32
14

You are probably (?) doing this at the wrong stage of your analysis

It appears that you are trying to do a textual analysis of ASOIAF and exclude chapter headings from your analysis but I think you're trying to do it at the wrong point in the analysis. Chapter headings are easy to identify in the original text because they are always at the top of the page, always centered and always followed by a gap. These features would allow you to easily and reliably identify headings but this information has been thrown away before you are trying to identify the headings. If you are in control of this stage of the analysis it likely be easier for you to determine which entries are headings at this stage instead.

You don't need Regex to do this

Although you specify Regex in the question title, it is not included in the question body, I therefore assume you don't actually need it but have simply ended up looking for a Regex solution to a problem where it is not required.

The easiest way of testing for all capital letters is to do x == toupper(x). toupper() will convert all alphabetic characters into their upper case form, you can then test for whether your string is all upper case by comparing it to this transformed version.

Screening out strings of length less than 2 is also easy, you can do this simply by adding an nchar(x) >=2 condition.

Your final requirement is less trivial but you will need to work out exactly what the condition you need to exclude is. I suspect if you're getting full paragraphs (?) then the best thing would be to look for quotation marks. Depending on the range of options that need to be matched you may need to employ Regex here after all, but if it's only a few specific marks you could use str_detect (from the stringr package) with the fixed() option to detect it as this will be considerably faster.

Regardless of whether you use Regex for this final stage, I would wrap the detection into a series of conditionals in a function rather than doing a single Regex search as this will be faster and, in my opinion, conceptually easier to understand.

Jack Aidley
  • 19,439
  • 7
  • 43
  • 70
  • 2
    I did some testing, and this comes out about 7.5x faster than a Regex solution for strings about the length of your example regardless of composition. For very long strings (1000 characters) it is 1.2-4.5x faster depending on the composition of the string, but for very, very long strings (100,000 characters) it can be considerably slower or faster depending on the composition of the string. – Jack Aidley May 23 '18 at 12:00
  • 1
    nchar, not length (+1) – Hong Ooi May 23 '18 at 14:54
  • For very very long strings, something like this will probably be fast: `lc_raw = charToRaw(paste(letters, collapse = "")); any(lc_raw %in% charToRaw(test_string))`. Not vectorized over the test string, though. – Gregor Thomas May 23 '18 at 15:12
  • Add `&& x != tolower(x)` and you're all set. – Vilx- May 23 '18 at 17:35
10

Edit:

I initially thought you want to ignore lowercase letters if their length is <2. If you want to make sure that all the letters are uppercase but only if the length of the whole string is >=2, a much simpler regex would do it:

^(?:[A-Z](?:[^A-Za-z\r\n])*){2,}$

Demo.

Or if you want to match a string with length >=2 even if it contains only one letter (e.g., "A@"):

^(?=.{2})(?:[A-Z](?:[^A-Za-z\r\n])*)+$

Another demo.


Original answer:

Here's a regex-only solution that only checks if the characters are uppercase if they're >=2:

^(?:[A-Z]{2,}(?:[^A-Za-z\r\n]|[A-Za-z](?![a-z]))*)+$

Try it online.

Or:

^(?:[[:upper:]]{2,}(?:[^[:alpha:]\r\n]|[[:alpha:]](?![[:lower:]]))*)+$

Try it online.

Breakdown:

  • ^: Asserts position at the start of the line/string.
  • (?:: Start of the first non-capturing group.
    • [A-Z]: Matches any uppercase English letter.1
    • {2,}: Matches two or more times of the previous character.
    • (?:: Start of the second non-capturing group.
      • [^A-Za-z\r\n]: Matches any character that isn't an English letter or a line terminator.2
      • |: Or.
      • [A-Za-z]: Matches any English letter.3
      • (?!: Start of a negative Lookahead.
        • [a-z]: Matches any lowercase English letter.4
      • ): End of negative Lookahead.
    • ): End of the second non-capturing group.
    • *: Matches zero or more times of the previous group.
  • ): End of the first non-capturing group.
  • +: Matches one or more times of the previous group.
  • $: Asserts position at the end of the line/string

Note: To treat the whole string as one line, simply remove the \r\n part.


  1. Use [[:upper:]] instead, if you want to match non-English letters.
  2. Use [^[:alpha:]\r\n] instead, if you want to match non-English letters.
  3. Use [[:alpha:]] instead, if you want to match non-English letters.
  4. Use [[:lower:]] instead, if you want to match non-English letters.
5

The way I understand this:

  • if the length of the whole string is < 2, anything goes.
  • otherwise, the string can have anything except lowercase characters.

With that, I think this regex should be enough:

^(.|[^[:lower:]]{2,})$

Which is the disjunction of

  • Single character, anything goes: ^.$
  • Multiple characters, only non-lowercase: ^[^[:lower:]]{2,}$

Trying it out:

> str_detect("THE QUEEN’S HAND", "^(.|[^[:lower:]]{2,})$")
[1] TRUE
> str_detect("THE QUEEN’S HaND", "^(.|[^[:lower:]]{2,})$")
[1] FALSE
> str_detect("i", "^(.|[^[:lower:]]{2,})$")
[1] TRUE
> str_detect("I", "^(.|[^[:lower:]]{2,})$")
[1] TRUE
> str_detect("ii", "^(.|[^[:lower:]]{2,})$")
[1] FALSE
muru
  • 4,723
  • 1
  • 34
  • 78
4

EDIT

To take care of number of characters the string has, we can use nchar with ifelse without changing the regex.

str <- "THE QUEEN'S HAND"
ifelse(nchar(str) >= 2 , grepl("^[A-Z]+$" , gsub("[^A-Za-z]","", str)), FALSE)
#[1] TRUE

str <- "THE QUEEN's HAND"
ifelse(nchar(str) >= 2 , grepl("^[A-Z]+$" , gsub("[^A-Za-z]","", str)), FALSE)
#[1] FALSE

str <- "T"
ifelse(nchar(str) >= 2 , grepl("^[A-Z]+$" , gsub("[^A-Za-z]","", str)), FALSE)
#[1] FALSE

Or as @Konrad Rudolph commented we can avoid the ifelse check using the logical operator.

str <- c("THE QUEEN'S HAND", "THE QUEEN's HAND", "T")
nchar(str) >= 2 & grepl("^[A-Z]+$" , gsub("[^A-Za-z]","", str))
#[1]  TRUE FALSE FALSE

Original Answer

We first replace all non alphabetic characters with empty space ("") with gsub and then compare it with toupper.

text = gsub("[^a-zA-Z]+", "", "THE QUEENS HAND")

text
#[1] "THEQUEENSHAND"

text == toupper(text)
#[1] TRUE

For a string with lower case, it will return FALSE

text = gsub("[^a-zA-Z]+", "", "THE QUEENs HAND")
text == toupper(text)
#[1] FALSE

And as @SymbolixAU commented, we can keep the entire thing as regex only by using grepl and gsub

grepl("^[A-Z]+$" , gsub("[^A-Za-z]","", "THE QUEEN'S HAND"))
#[1] TRUE

grepl("^[A-Z]+$" , gsub("[^A-Za-z]","", "THE QUEEN's HAND"))
#[1] FALSE
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
3

If I understand your question correctly, you want to accept strings that:

  1. Are two characters or longer.
  2. Do not contain lowercase letters.

If that is correct, you're not far from the correct answer. But yours only accept uppercase letters instead of accepting anything but lowercase characters.

The following regex should work:

^[^[:lower:]]{2,}+$
Édouard
  • 263
  • 2
  • 8
2

To stay within the stringr context, use str_replace_all to get only alphabet characters, and then str_detect to check for uppercase:

string1 <- "THE QUEEN’S HAND"
string2 <- "T"

string1 %>%
  str_replace_all(., "[^a-zA-Z]", "") %>%
  str_detect(., "[[:upper:]]{2,}")
# TRUE

string2 %>%
  str_replace_all(., "[^a-zA-Z]", "") %>%
  str_detect(., "[[:upper:]]{2,}")
# FALSE
andrew_reece
  • 20,390
  • 3
  • 33
  • 58