-1

I have a column of data set that looks like this:

$abc.MSFT

$MSFT

$msft

$abcMSFTxyz

I want the following output:

$MSFT  

$msft

My attempt at filtering:

dplyr::filter(Tweets, grepl("\\bMMM$\\b", ignore.case = TRUE, V2))

returns:

$abc.MSFT

$MSFT

$msft

or

dplyr::filter(Tweets,grepl("^$MMM$", ignore.case = TRUE, V2))

returns:

Wagish
  • 3
  • 6
  • 1
    try `grepl("^\\$msft$", ignore.case = TRUE, x)`. The dollar sign is a special character in regex. If you want a literal `$` matched, you must escape the character with backslashes. – Pierre L Oct 08 '15 at 01:48
  • @ Pierre \\bMMM$\\b works but it cannot ignore punctuation characters at the start of the string. I want it to ignore all other punctuation characters except for the $ character. – Wagish Oct 08 '15 at 01:51
  • @ Pierre, okay I'll try. Thanks – Wagish Oct 08 '15 at 01:52
  • If `"\\bMMM$\\b"` matches `"MSFT"` for you then I know nothing about regex and you should be the one answering questions. I'd be really interested in seeing the code example. (Not just you saying "it returns..") – Pierre L Oct 08 '15 at 01:54
  • 1
    @ Pierre Thanks your solution works! – Wagish Oct 08 '15 at 01:57
  • @ Pierre http://stackoverflow.com/questions/17906003/detecting-word-boundary-with-regex-in-data-frame-in-r , I used that example to try :) – Wagish Oct 08 '15 at 01:59
  • That method could work too, but you have to remember to include the backslashes `"^\\b\\$msft\\b"` – Pierre L Oct 08 '15 at 02:04

1 Answers1

0

One way to approach it:

x <- c("$abc.MSFT", "$MSFT", "$msft", "$abcMSFTxyz")
Tweets <- data.frame(V2=x, stringsAsFactors=F)
Tweets
#           V2
#1   $abc.MSFT
#2       $MSFT
#3       $msft
#4 $abcMSFTxyz

#your way
dplyr::filter(Tweets, grepl("\\bMMM$\\b", ignore.case = TRUE, V2))
[1] V2
<0 rows> (or 0-length row.names)

#another way
dplyr::filter(Tweets, grepl("^\\$msft$", ignore.case = TRUE, V2))
     V2
1 $MSFT
2 $msft

From regex help:

..there are 12 characters with special meanings: the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening parenthesis (, the closing parenthesis ), and the opening square bracket [, the opening curly brace {, These special characters are often called "metacharacters".

And the fix:

If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1\\+1=2. Otherwise, the plus sign has a special meaning.

Research regular expressions. They are worth the time to learn whatever language you wish to program in.

Pierre L
  • 28,203
  • 6
  • 47
  • 69