This is my first post in stack overflow and I'll try and explain my problem as succintly as possible.
The problem is pretty simple. I'm trying to identify strings containing alphanumeric characters and alphanumeric characters with symbols and remove them. I looked at previous questions in Stack overflow and found a solution that looks good.
https://stackoverflow.com/a/21456918/7467476
I tried the provided regex (slightly modified) in notepad++ on some sample data just to see if its working (and yes, it works). Then, I proceeded to use the same regex in R and use gsub to replace the string with "" (code given below).
replace_alnumsym <- function(x) {
return(gsub("(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[_-])[A-Za-z0-9_-]{8,}", "", x, perl = T))
}
replace_alnum <- function(x) {
return(gsub("(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{8,}", "", x, perl = T))
}
sample <- c("abc def ghi WQE34324Wweasfsdfs23234", "abcd efgh WQWEQtWe_232")
output1 <- sapply(sample, replace_alnum)
output2 <- sapply(sample, replace_alnumsym)
The code runs fine but the output still contains the strings. It hasn't been removed. I'm not getting any errors when I run the code (output below). The output format is also strange. Each element is printed twice (once without and once within quotes).
> output1
abc def ghi WQE34324Wweasfsdfs23234 abcd efgh WQWEQtWe_232
"abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh WQWEQtWe_232"
> output2
abc def ghi WQE34324Wweasfsdfs23234 abcd efgh WQWEQtWe_232
"abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh WQWEQtWe_232"
The desired result would be:
> output1
abc def ghi abcd efgh WQWEQtWe_232
> output2
abc def ghi WQE34324Wweasfsdfs23234 abcd efgh
I think I'm probably overlooking something very obvious.
Appreciate any assistance that you can provide.
Thanks