String identification in text files using regex in R

Question

This is my first post in stack overflow and I'll try and explain my problem as succintly as possible.

The problem is pretty simple. I'm trying to identify strings containing alphanumeric characters and alphanumeric characters with symbols and remove them. I looked at previous questions in Stack overflow and found a solution that looks good.

https://stackoverflow.com/a/21456918/7467476

I tried the provided regex (slightly modified) in notepad++ on some sample data just to see if its working (and yes, it works). Then, I proceeded to use the same regex in R and use gsub to replace the string with "" (code given below).

replace_alnumsym <- function(x) {
    return(gsub("(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[_-])[A-Za-z0-9_-]{8,}", "", x, perl = T))
}
replace_alnum <- function(x) {
    return(gsub("(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{8,}", "", x, perl = T))
}
sample <- c("abc def ghi WQE34324Wweasfsdfs23234", "abcd efgh WQWEQtWe_232")

output1 <- sapply(sample, replace_alnum)
output2 <- sapply(sample, replace_alnumsym)

The code runs fine but the output still contains the strings. It hasn't been removed. I'm not getting any errors when I run the code (output below). The output format is also strange. Each element is printed twice (once without and once within quotes).

> output1
  abc def ghi WQE34324Wweasfsdfs23234                abcd efgh WQWEQtWe_232 
"abc def ghi WQE34324Wweasfsdfs23234"              "abcd efgh WQWEQtWe_232" 

> output2
  abc def ghi WQE34324Wweasfsdfs23234                abcd efgh WQWEQtWe_232 
"abc def ghi WQE34324Wweasfsdfs23234"              "abcd efgh WQWEQtWe_232"

The desired result would be:

> output1
  abc def ghi                 abcd efgh WQWEQtWe_232 

> output2
  abc def ghi WQE34324Wweasfsdfs23234                abcd efgh

I think I'm probably overlooking something very obvious.

Appreciate any assistance that you can provide.

Thanks

Thanks.....this works....appreciate it!!! Can you provide an explanation of the regex.?...that would really help... — DS_1, Jan 25 '17 at 08:10
You already accepted another answer. Use it if it works for you. — Wiktor Stribiżew, Jan 25 '17 at 08:21
@Wiktor Stribizew your solution is much nicer than mine. I'd gladly concede the points if you want to post an answer. — rosscova, Jan 25 '17 at 08:38
@Gautam Venkatraman I suggest you remove your acceptance of my answer, and accept Wiktor's, it's a more tidy way to do what you're trying to do (not to mention the excellent explanation of the expressions being used). — rosscova, Jan 25 '17 at 09:40

rosscova · Accepted Answer · 2017-01-25T08:00:21.643

Your outputs are not printing twice, they're being output as named vectors. The unquoted line is the element names, the quoted line in the output itself. You can see this by checking the length of an output:

length( sapply( sample, replace_alnum ) )
# [1] 2

So you know there are only 2 elements there.

If you want them without the names, you can unname the vector on output:

unname( sapply( sample, replace_alnum ) )
# [1] "abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh WQWEQtWe_232"

Alternatively, you can rename them something more to your liking:

output <- sapply( sample, replace_alnum )
names( output ) <- c( "name1", "name2" )
output
#              name1                                 name2 
# "abc def ghi WQE34324Wweasfsdfs23234"              "abcd efgh WQWEQtWe_232"

As far as the regex itself, it sounds like what you want is to apply it to each string separately. If so, and if you want them back to where they were at the end, you need to split them by space, then recombine them at the end.

# split by space (leaving results in separate list items for recombining later)
input <- sapply( sample, strsplit, split = " " )

# apply your function on each list item separately
output <- sapply( input, replace_alnumsym )

# recombine each list item as they looked at the start
output <- sapply( output, paste, collapse = " " )
output <- unname( output )    

output
# [1] "abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh "

And if you want to clean up the trailing white space:

output <- trimws( output )
output
# [1] "abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh"

Hi...Thanks for that explanation! However, I'm still facing the problem with the regex not working in R. Any solution to that? — DS_1, Jan 25 '17 at 07:37
Are you trying to analyse each text string (separated by a space) separately? — rosscova, Jan 25 '17 at 07:44
I'm looking at a vector of strings and want to remove from each string, the substring (separated by a space) matching the regex. — DS_1, Jan 25 '17 at 07:50
Thanks you so much for the solution....really appreciate it!!! But I'm just curious...why didn't it work directly on the complete string....why to change it to a list first?? Shouldn't it automatically have identified the substring and replaced it? — DS_1, Jan 25 '17 at 08:13
No. A space is a part of the string, so `gsub` was analysing the entire string as one part. The string as a whole didn't fulfil your regex test, so it wasn't replaced. — rosscova, Jan 25 '17 at 08:14

score 1 · Answer 2 · answered Jan 25 '17 at 09:26

No idea if this regex-based approach is really fine, but it is possible if we assume that:

alnumsym "words" are non-whitespace chunks delimited with whitespace and start/end of string
alnum words are chunks of letters/digits separated with non-letter/digits or start/end of string.

Then, you may use

sample <- c("abc def ghi WQE34324Wweasfsdfs23234", "abcd efgh WQWEQtWe_232")
gsub("\\b(?=\\w*[a-z])(?=\\w*[A-Z])(?=\\w*\\d)\\w{8,}", "", sample, perl=TRUE) ## replace_alnum
gsub("(?<!\\S)(?=\\S*[a-z])(?=\\S*[A-Z])(?=\\S*[0-9])(?=\\S*[_-])[A-Za-z0-9_-]{8,}", "", sample, perl=TRUE) ## replace_alnumsym

See the R demo online.

Pattern 1 details:

\\b - a leading word boundary (we need to match a word)
(?=\\w*[a-z]) - (a positive lookahead) after 0+ word chars (\w*) there must be a lowercase ASCII letter
(?=\\w*[A-Z]) - an uppercase ASCII letter must be inside this word
(?=\\w*\\d) - and a digit, too
\\w{8,} - if all the conditions above matched, match 8+ word chars

Note that to avoid matching _ (\w matches _) you need to replace \w with [^\W_].

Pattern 2 details:

(?<!\\S) - (a negative lookbehind) no non-whitespace can appear immediately to the left of the current location (a whitespace or start of string should be in front)
(?=\\S*[a-z]) - after 0+ non-whitespace chars, there must be a lowercase ASCII letter
(?=\\S*[A-Z]) - the non-whitespace chunk must contain an uppercase ASCII letter
(?=\\S*[0-9]) - and a digit
(?=\\S*[_-]) - and either _ or -
[A-Za-z0-9_-]{8,} - if all the conditions above matched, match 8+ ASCII letters, digits or _ or -.

String identification in text files using regex in R

2 Answers2