11

I have a keyword (e.g. 'green') and some text ("I do not like them Sam I Am!").

I'd like to see how many of the characters in the keyword ('g', 'r', 'e', 'e', 'n') occur in the text (in any order).

In this example the answer is 3 - the text doesn't have a G or R but has two Es and an N.

My problem arises where if a character in the text is matched with a character in the keyword, then it can't be used to match a different character in the keyword.

For example, if my keyword was 'greeen', the number of "matching characters" is still 3 (one N and two Es) because there are only two Es in the text, not 3 (to match the third E in the keyword).

How can I write this in R? This is just ticking something at the edge of my memory - I feel like it's a common problem but just worded differently (sort of like sampling with no replacement, but "matches with no replacement"?).

E.g.

keyword <- strsplit('greeen', '')[[1]]
text <- strsplit('idonotlikethemsamiam', '')[[1]]
# how many characters in keyword have matches in text,
# with no replacement?
# Attempt 1: sum(keyword %in% text)
# PROBLEM: returns 4 (all three Es match, but only two in text)

More examples of expected input/outputs (keyword, text, expected output):

  • 'green', 'idonotlikethemsamiam', 3 (G, E, E)
  • 'greeen', 'idonotlikethemsamiam', 3 (G, E, E)
  • 'red', 'idonotlikethemsamiam', 2 (E and D)
mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194

2 Answers2

14

The function pmatch() is great for this. Though it would be instinctual to use length here, length has no na.rm option. So to work around this nuisance, sum(!is.na()) is used.

keyword <- unlist(strsplit('greeen', ''))
text <- unlist(strsplit('idonotlikethemsamiam', ''))

sum(!is.na(pmatch(keyword, text)))

# [1] 3

keyword2 <- unlist(strsplit("red", ''))
sum(!is.na(pmatch(keyword2, text)))

# [1] 2
N8TRO
  • 3,348
  • 3
  • 22
  • 40
-1

Perhaps you are looking to find the UNIQUE components of your keyword? Try:

keyword <- unique(strsplit('greeen','')[[1]])
Gary Weissman
  • 3,557
  • 1
  • 18
  • 23
  • no, I'm not. I'm trying to find the *number* of characters in the keyword that occur within the text, where if a character from the text matches one from the keyword, it cannot be used to match another character from the keyword. My desired output is **numeric**, not a vector of characters. – mathematical.coffee Feb 18 '13 at 01:54