0
x = c(1,2,3,4,5)
y = c("AA","BB","CC", "AAAA","BBBB")
data1 = data.frame(x,y)
data1

^^I want the output to be the number of time the 4 letters occur in the y column. Desired output would be 2

I want to count the number of times a 4 letter factor observations occurs in a given column in a dataframe. How do I do this?

dd2019
  • 1
  • 1
  • 4
    Welcome to Stack Overflow. It's difficult to answer your question without a minimal reproducible example. Can you give us an example of your column, dataframe & the 4 letter factor please? – nopassport1 Feb 06 '20 at 16:55
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Feb 06 '20 at 19:21
  • 1
    *Any* four-letter value or just those values where the exact same letters are repeated four times? – Chris Ruehlemann Feb 06 '20 at 20:32

2 Answers2

0

If you only want to extract and count factor values that have exactly 4 letters (any letters, not necessarily the same), then you can do this:

Step 1--Define a pattern to match:

pattern <- "\\w{4}"

Step 2--Define a function to extract only the raw matches:

extract <- function(x) unlist(regmatches(x, gregexpr(pattern, x, perl = T)))

Step 3--Apply the function to the data of interest:

extract(data1$y)

And that's the result:

[1] "AAAA" "BBBB"

Step 4--To count the number of matches you can use length:

length(extract(data1$y))
[1] 2

EDIT: Alternatively you can use str_extract from the package stringr:

STEP 1: store the result in a vector extr:

extr <- str_extract(data1$y, "\\w{4}")

STEP 2: using length, the negation operator ! and is.na, a function that tests for NA and evaluates to TRUE and FALSE, you can count the number of times that test evaluates to FALSE:

length(extr[!is.na(extr)])
[1] 2
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
0

Maybe you can try nchar if you have strings in column y always consisting of letters

sum(nchar(as.vector(data1$y))==4)

# > sum(nchar(as.vector(data1$y))==4)
#   2
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • This worked, thank you! do you know any way to find the number of unique values that have 4 characters? In this case it would be 2 still but suppose there were two of "AAAA". With this code with an extra "AAAA" it would return 3, any way to have it count only unique values meeting this parameter and output 2? – dd2019 Feb 09 '20 at 22:18