R - count matches between characters of one string and another, no replacement

Question

I have a keyword (e.g. 'green') and some text ("I do not like them Sam I Am!").

I'd like to see how many of the characters in the keyword ('g', 'r', 'e', 'e', 'n') occur in the text (in any order).

In this example the answer is 3 - the text doesn't have a G or R but has two Es and an N.

My problem arises where if a character in the text is matched with a character in the keyword, then it can't be used to match a different character in the keyword.

For example, if my keyword was 'greeen', the number of "matching characters" is still 3 (one N and two Es) because there are only two Es in the text, not 3 (to match the third E in the keyword).

How can I write this in R? This is just ticking something at the edge of my memory - I feel like it's a common problem but just worded differently (sort of like sampling with no replacement, but "matches with no replacement"?).

E.g.

keyword <- strsplit('greeen', '')[[1]]
text <- strsplit('idonotlikethemsamiam', '')[[1]]
# how many characters in keyword have matches in text,
# with no replacement?
# Attempt 1: sum(keyword %in% text)
# PROBLEM: returns 4 (all three Es match, but only two in text)

More examples of expected input/outputs (keyword, text, expected output):

'green', 'idonotlikethemsamiam', 3 (G, E, E)
'greeen', 'idonotlikethemsamiam', 3 (G, E, E)
'red', 'idonotlikethemsamiam', 2 (E and D)

N8TRO · Accepted Answer · 2013-02-18T04:59:45.203

14

The function pmatch() is great for this. Though it would be instinctual to use length here, length has no na.rm option. So to work around this nuisance, sum(!is.na()) is used.

keyword <- unlist(strsplit('greeen', ''))
text <- unlist(strsplit('idonotlikethemsamiam', ''))

sum(!is.na(pmatch(keyword, text)))

# [1] 3

keyword2 <- unlist(strsplit("red", ''))
sum(!is.na(pmatch(keyword2, text)))

# [1] 2

edited Feb 18 '13 at 04:59

answered Feb 18 '13 at 02:15

N8TRO

3,348
3
22
40

Aha! I spent ages trying `match` and `charmatch` and didn't notice that `pmatch` didn't allow duplicates (exactly what I wanted). Thanks a lot! – mathematical.coffee Feb 18 '13 at 02:56
@mathematical.coffee pmatch does allow duplicates, but it's default is false. – N8TRO Feb 18 '13 at 03:00
yep, sorry, I mean "didn't notice that `pmatch` *had the option* to not allow duplicates" – mathematical.coffee Feb 18 '13 at 03:08
`unlist` may avoid problems when there is a vector of character strings rather than subsetting `[[1]]` – Brandon Bertelsen Feb 18 '13 at 04:54
@BrandonBertelsen Thanks, I was just using the example from the question for my data. Editing now. – N8TRO Feb 18 '13 at 04:58

score -1 · Answer 2 · answered Feb 18 '13 at 01:53

-1

Perhaps you are looking to find the UNIQUE components of your keyword? Try:

keyword <- unique(strsplit('greeen','')[[1]])

answered Feb 18 '13 at 01:53

Gary Weissman

3,557
1
18
23

no, I'm not. I'm trying to find the *number* of characters in the keyword that occur within the text, where if a character from the text matches one from the keyword, it cannot be used to match another character from the keyword. My desired output is **numeric**, not a vector of characters. – mathematical.coffee Feb 18 '13 at 01:54

R - count matches between characters of one string and another, no replacement

2 Answers2

Linked