Split vectors by uppercases and lower cases

Question

I've read some nice question about splitting uppercases and lowercases, like this, and this, but I cannot manage to make them work with my data.

# here my data
    data <- data.frame(text = c("SOME UPPERCASES     And some Lower Cases"
                                ,"OTHER UPPER CASES   And other words"
                                , "Some lower cases        AND UPPER CASES"
                                ,"ONLY UPPER CASES"
                                ,"Only lower cases, maybe"
                                ,"UPPER lower UPPER!"))
    data
                                         text
    1 SOME UPPERCASES     And some Lower Cases
    2      OTHER UPPER CASES   And other words
    3  Some lower cases        AND UPPER CASES
    4                         ONLY UPPER CASES
    5                  Only lower cases, maybe
    6                        UPPER lower UPPER!

The desired result should be something like this:

       V1                  V2
1      SOME UPPERCASES     And some Lower Cases
2      OTHER UPPER CASES   And other words
3      AND UPPER CASES     Some lower cases        
4      ONLY UPPER CASES    NA
5      NA                  Only lower cases, maybe
6      UPPER UPPER!         lower

So separate all the words with uppercases only letters, from the others.

As test, I've tried only for one line some ways but none of them work well:

strsplit(x= data$text[1], split="[[:upper:]]")   # error
gsub('([[:upper:]])', ' \\1', data$text[1])      # not good results

library(reshape)
transform(data, FOO = colsplit(data$text[1], split = "[[:upper:]]", names = c('a', 'b')))                                        # neither good results

It is not clear what your rules are. You want to omit the `!` in the last row, but you keep `,` in the previous row. What are your precise rules here? — Wiktor Stribiżew, Oct 17 '18 at 08:48
Thanks a lot, there was a typo, the punctuation follow the case of the previous letter. — s__, Oct 17 '18 at 08:53

Andre Elrico · Accepted Answer · 2018-10-17T08:57:17.577

data:

data <- data.frame(text = c("SOME UPPERCASES     And some Lower Cases"
                            ,"OTHER UPPER CASES   And other words"
                            , "Some lower cases        AND UPPER CASES"
                            ,"ONLY UPPER CASES"
                            ,"Only lower cases, maybe"
                            ,"UPPER lower UPPER!"))

code:

library(magrittr)

UpperCol    <- regmatches(data$text , gregexpr("\\b[A-Z]+\\b",data$text)) %>% lapply(paste, collapse = " ") %>% unlist
notUpperCol <- regmatches(data$text , gregexpr("\\b(?![A-Z]+\\b)[a-zA-Z]+\\b",data$text, perl = T)) %>% lapply(paste, collapse = " ") %>% unlist

result <- data.frame(I(UpperCol), I(notUpperCol))
result[result == ""] <- NA

result:

#           UpperCol            notUpperCol
#1   SOME UPPERCASES   And some Lower Cases
#2 OTHER UPPER CASES        And other words
#3   AND UPPER CASES       Some lower cases
#4  ONLY UPPER CASES                   <NA>
#5              <NA> Only lower cases maybe
#6       UPPER UPPER                  lower

The trick is regex. So learn regex
Thanks to Wiktor Stribiżew for some optimization.

Be careful with [`[A-z]+`, it does not only match letters](https://stackoverflow.com/questions/29771901/why-is-this-regex-allowing-a-caret/29771926#29771926). Also, `(?![A-Z]+\\b)` should be put after the leading `\b` to be more efficient (=> `"\\b(?![A-Z]+\\b)[a-zA-Z]+\\b"`). — Wiktor Stribiżew, Oct 17 '18 at 08:49
Thanks a lot, that part is missing in my bag of tools, I need to learn (+1). — s__, Oct 17 '18 at 08:54

score 1 · Answer 2 · answered Oct 17 '18 at 09:07

An approach using the stringi package:

library(stringi)
l1 <- stri_extract_all_regex(dat$text, "\\b[A-Z]+\\b")
l2 <- mapply(setdiff, stri_extract_all_words(dat$text), l1)

res <- data.frame(all_upper = sapply(l1, paste, collapse = " "),
                  not_all_upper = sapply(l2, paste, collapse = " "),
                  stringsAsFactors = FALSE)
res[res == "NA"] <- NA
res[res == ""] <- NA

which gives:

> res
          all_upper          not_all_upper
1   SOME UPPERCASES   And some Lower Cases
2 OTHER UPPER CASES        And other words
3   AND UPPER CASES       Some lower cases
4  ONLY UPPER CASES                   <NA>
5              <NA> Only lower cases maybe
6       UPPER UPPER                  lower

s_baldur · Answer 3 · 2018-10-17T10:12:21.413

separate <- function(x) {
  x <- unlist(strsplit(as.character(x), "\\s+"))
  with_lower <- grepl("\\p{Ll}", x, perl = TRUE)
  list(paste(x[!with_lower], collapse = " "),  paste(x[with_lower], collapse = " "))
}


do.call(rbind, lapply(data$text, separate))

     [,1]                [,2]                     
[1,] "SOME UPPERCASES"   "And some Lower Cases"   
[2,] "OTHER UPPER CASES" "And other words"        
[3,] "AND UPPER CASES"   "Some lower cases"       
[4,] "ONLY UPPER CASES"  ""                       
[5,] ""                  "Only lower cases, maybe"
[6,] "UPPER UPPER!"      "lower"

Split vectors by uppercases and lower cases

3 Answers3