1

I've read some nice question about splitting uppercases and lowercases, like this, and this, but I cannot manage to make them work with my data.

# here my data
    data <- data.frame(text = c("SOME UPPERCASES     And some Lower Cases"
                                ,"OTHER UPPER CASES   And other words"
                                , "Some lower cases        AND UPPER CASES"
                                ,"ONLY UPPER CASES"
                                ,"Only lower cases, maybe"
                                ,"UPPER lower UPPER!"))
    data
                                         text
    1 SOME UPPERCASES     And some Lower Cases
    2      OTHER UPPER CASES   And other words
    3  Some lower cases        AND UPPER CASES
    4                         ONLY UPPER CASES
    5                  Only lower cases, maybe
    6                        UPPER lower UPPER!

The desired result should be something like this:

       V1                  V2
1      SOME UPPERCASES     And some Lower Cases
2      OTHER UPPER CASES   And other words
3      AND UPPER CASES     Some lower cases        
4      ONLY UPPER CASES    NA
5      NA                  Only lower cases, maybe
6      UPPER UPPER!         lower

So separate all the words with uppercases only letters, from the others.

As test, I've tried only for one line some ways but none of them work well:

strsplit(x= data$text[1], split="[[:upper:]]")   # error
gsub('([[:upper:]])', ' \\1', data$text[1])      # not good results

library(reshape)
transform(data, FOO = colsplit(data$text[1], split = "[[:upper:]]", names = c('a', 'b')))                                        # neither good results
Jaap
  • 81,064
  • 34
  • 182
  • 193
s__
  • 9,270
  • 3
  • 27
  • 45
  • 1
    It is not clear what your rules are. You want to omit the `!` in the last row, but you keep `,` in the previous row. What are your precise rules here? – Wiktor Stribiżew Oct 17 '18 at 08:48
  • Thanks a lot, there was a typo, the punctuation follow the case of the previous letter. – s__ Oct 17 '18 at 08:53

3 Answers3

1

data:

data <- data.frame(text = c("SOME UPPERCASES     And some Lower Cases"
                            ,"OTHER UPPER CASES   And other words"
                            , "Some lower cases        AND UPPER CASES"
                            ,"ONLY UPPER CASES"
                            ,"Only lower cases, maybe"
                            ,"UPPER lower UPPER!"))

code:

library(magrittr)

UpperCol    <- regmatches(data$text , gregexpr("\\b[A-Z]+\\b",data$text)) %>% lapply(paste, collapse = " ") %>% unlist
notUpperCol <- regmatches(data$text , gregexpr("\\b(?![A-Z]+\\b)[a-zA-Z]+\\b",data$text, perl = T)) %>% lapply(paste, collapse = " ") %>% unlist

result <- data.frame(I(UpperCol), I(notUpperCol))
result[result == ""] <- NA

result:

#           UpperCol            notUpperCol
#1   SOME UPPERCASES   And some Lower Cases
#2 OTHER UPPER CASES        And other words
#3   AND UPPER CASES       Some lower cases
#4  ONLY UPPER CASES                   <NA>
#5              <NA> Only lower cases maybe
#6       UPPER UPPER                  lower

  • The trick is regex. So learn regex
  • Thanks to Wiktor Stribiżew for some optimization.
Andre Elrico
  • 10,956
  • 6
  • 50
  • 69
  • 1
    Be careful with [`[A-z]+`, it does not only match letters](https://stackoverflow.com/questions/29771901/why-is-this-regex-allowing-a-caret/29771926#29771926). Also, `(?![A-Z]+\\b)` should be put after the leading `\b` to be more efficient (=> `"\\b(?![A-Z]+\\b)[a-zA-Z]+\\b"`). – Wiktor Stribiżew Oct 17 '18 at 08:49
  • Thanks a lot, that part is missing in my bag of tools, I need to learn (+1). – s__ Oct 17 '18 at 08:54
  • @WiktorStribiżew Thank you. Both remarks are valuable! – Andre Elrico Oct 17 '18 at 08:58
1

An approach using the package:

library(stringi)
l1 <- stri_extract_all_regex(dat$text, "\\b[A-Z]+\\b")
l2 <- mapply(setdiff, stri_extract_all_words(dat$text), l1)

res <- data.frame(all_upper = sapply(l1, paste, collapse = " "),
                  not_all_upper = sapply(l2, paste, collapse = " "),
                  stringsAsFactors = FALSE)
res[res == "NA"] <- NA
res[res == ""] <- NA

which gives:

> res
          all_upper          not_all_upper
1   SOME UPPERCASES   And some Lower Cases
2 OTHER UPPER CASES        And other words
3   AND UPPER CASES       Some lower cases
4  ONLY UPPER CASES                   <NA>
5              <NA> Only lower cases maybe
6       UPPER UPPER                  lower
Jaap
  • 81,064
  • 34
  • 182
  • 193
1
separate <- function(x) {
  x <- unlist(strsplit(as.character(x), "\\s+"))
  with_lower <- grepl("\\p{Ll}", x, perl = TRUE)
  list(paste(x[!with_lower], collapse = " "),  paste(x[with_lower], collapse = " "))
}


do.call(rbind, lapply(data$text, separate))

     [,1]                [,2]                     
[1,] "SOME UPPERCASES"   "And some Lower Cases"   
[2,] "OTHER UPPER CASES" "And other words"        
[3,] "AND UPPER CASES"   "Some lower cases"       
[4,] "ONLY UPPER CASES"  ""                       
[5,] ""                  "Only lower cases, maybe"
[6,] "UPPER UPPER!"      "lower"  
s_baldur
  • 29,441
  • 4
  • 36
  • 69