How to remove everything before first occurrence of comma in R

Question

I am using trying to remove the text up until the first comma in a string that has one or more commas. For some reason I am finding that this always removes everything up until the last comma for all strings.

The string looks like:

OCR - (some text), Variant - (some text), Bad Subtype - (some text)

and my regex is returning:

Bad Subtype - (some text)

when the desired output is:

Variant - (some text), Bad Subtype - (some text)

Variant is not guaranteed to be in the second position.

#select all strings beginning with OCR in the column Tags
clean<- subset(all, grepl("^OCR", all$Tags)
#trim the OCR text up to the first comma, and store in a new column called Tag
    clean$Tag<- gsub(".*,", "", clean$Tag)

or

clean$Tag <- gsub(".*\\,", "", clean$Tag)

or

clean$Tag<- sub(".*,", "", clean$Tag)

etc..

The problem you're having is your `*` operator is greedy. See [What do lazy and greedy mean in the context of regular expressions?](https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions) for more info. — Ian Campbell, Apr 01 '21 at 20:25
I am also calling str_trim(x, side="both"), once I have finished trimming — maedel, Apr 01 '21 at 22:06

score 5 · Accepted Answer · answered Apr 01 '21 at 20:08

Here is one regex that does the job.

x <- "OCR - (some text), Variant - (some text), Bad Subtype - (some text) and my regex is returning: Bad Subtype - (some text) when the desired output is: Variant - (some text), Bad Subtype - (some text)"

sub("^[^,]*,", "", x)
#[1] " Variant - (some text), Bad Subtype - (some text) and my regex is returning: Bad Subtype - (some text) when the desired output is: Variant - (some text), Bad Subtype - (some text)"

Explanation

^ beginning of string;
^[^,]* any character at the beginning except for "," repeated zero or more times;
^[^,]*, the pattern in point 2 above followed by a comma.

This pattern replaced by the empty string "".

score 5 · Answer 2 · answered Apr 01 '21 at 20:14

An option with trimws from base R

trimws(x, whitespace = "^[^,]+,\\s*")

-output

#[1] "Variant - (some text), Bad Subtype - (some text) and my regex is returning: Bad Subtype - (some text) when the desired output is: Variant - (some text), Bad Subtype - (some text)"

data

x <- "OCR - (some text), Variant - (some text), Bad Subtype - (some text) and my regex is returning: Bad Subtype - (some text) when the desired output is: Variant - (some text), Bad Subtype - (some text)"

How to remove everything before first occurrence of comma in R

2 Answers2

data