1

I am currently trying to figure out how to use regex in order to clean up my textual data in R. I wonder where I could find an easy tutorial for it? I have been looking a bit online, but when I try something out on regex101 I hardly ever find matches. And if I do, within R, nothing changes. Consider this example

Before <- "ACEMOGLU, D., ROBINSON, J., (2012) WHY NATIONS FAIL, (3)"
After <- "ACEMOGLU, D., ROBINSON, J., 2012, WHY NATIONS FAIL, (3)"


> Aftergsub <- gsub("\\([\\d][\\d][\\d][\\d]\\)", "new", "ACEMOGLU, D., ROBINSON, J., (2012) WHY NATIONS FAIL, (3)")
> print(Aftergsub)
[1] "ACEMOGLU, D., ROBINSON, J., (2012) WHY NATIONS FAIL, (3)"
> 

Of course the "new" should be an expression that would make Before look like After. But I don't even get to change Before into anything else, based on my pattern.

In other words, how do I only change a ")" to a "," if it has been preceded by 4 digits? Thanks!

SCW
  • 155
  • 2
  • 11

1 Answers1

0

Your pattern does not work because TRE regex flavor does not support shorthand character classes inside bracket expressions. You should either use [[:digit:]] or [0-9], but not [\\d] (that actually matches a \ or a letter d).

You may use

Before <- "ACEMOGLU, D., ROBINSON, J., (2012) WHY NATIONS FAIL, (3)"
gsub("\\((\\d{4})\\)", "\\1,", Before)
## => [1] "ACEMOGLU, D., ROBINSON, J., 2012, WHY NATIONS FAIL, (3)"

See the R online demo

NOTE that I am using \\d without square brackets (=bracket expression) around it. TRE regex engine treats "\\d{4}" as a four digit matching pattern. It is equal to [0-9]{4} or [[:digit:]]{4}.

Details

  • \\( - a literal (
  • (\\d{4}) - Group 1: any four digits
  • \\) - a literal )
  • \\1 - the backreference to Group 1 value
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • you could also pass the variable instead of value in the `gsub`, `gsub("\\((\\d{4})\\)", "\\1,", Before)` – Sagar Aug 11 '17 at 14:32
  • Yeah, that is what OP will do in the end. I will include the code here. – Wiktor Stribiżew Aug 11 '17 at 14:33
  • I need to run such a gsub over many different entries, so if Sagar says I can pass the variable indead of the value, would that work also on other entries? If so, how would it look like, i.e. gsub("\\((\\d{4})\\)", "\\1,", "my.dataframe")? – SCW Aug 11 '17 at 14:34
  • @SteffiWinkler Usually, it looks like `df$col1 <- gsub("\\((\\d{4})\\)", "\\1,", df$col1)` but you may run it on the whole data frame, too (but with a bit different code, see [example here](https://stackoverflow.com/questions/14871249/can-i-use-gsub-on-each-element-of-a-data-frame)). – Wiktor Stribiżew Aug 11 '17 at 14:35
  • I think this would do then? Split.CR[] <- lapply(Split.CR, gsub, pattern='\\((\\d{4})\\)', replacement='\\1, Split.CR') – SCW Aug 11 '17 at 14:38
  • If you need to process a column, use `gsub` directly as I have shown. Else, adjust as you see fit. – Wiktor Stribiżew Aug 11 '17 at 14:40
  • It seems to work, but I don't completely get the explanation that I should use [[:digit:]] or [0-9], whereas you still use a "d". – SCW Aug 11 '17 at 14:41
  • @SteffiWinkler: I used `"\\d"`, **not** `"[\\d]"`. That is a huge difference for a TRE regex engine. I added a bit of an explanation under the snippet. – Wiktor Stribiżew Aug 11 '17 at 14:47