0

I am using trying to remove the text up until the first comma in a string that has one or more commas. For some reason I am finding that this always removes everything up until the last comma for all strings.

The string looks like:

OCR - (some text), Variant - (some text), Bad Subtype - (some text)

and my regex is returning:

Bad Subtype - (some text)

when the desired output is:

Variant - (some text), Bad Subtype - (some text)

Variant is not guaranteed to be in the second position.

#select all strings beginning with OCR in the column Tags
clean<- subset(all, grepl("^OCR", all$Tags)
#trim the OCR text up to the first comma, and store in a new column called Tag
    clean$Tag<- gsub(".*,", "", clean$Tag) 

or

clean$Tag <- gsub(".*\\,", "", clean$Tag)

or

clean$Tag<- sub(".*,", "", clean$Tag)

etc..

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
maedel
  • 27
  • 4
  • 2
    The problem you're having is your `*` operator is greedy. See [What do lazy and greedy mean in the context of regular expressions?](https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions) for more info. – Ian Campbell Apr 01 '21 at 20:25
  • 4
    The dupe tagged is different. It is about whitespace – akrun Apr 01 '21 at 20:37
  • I am also calling str_trim(x, side="both"), once I have finished trimming – maedel Apr 01 '21 at 22:06

2 Answers2

5

Here is one regex that does the job.

x <- "OCR - (some text), Variant - (some text), Bad Subtype - (some text) and my regex is returning: Bad Subtype - (some text) when the desired output is: Variant - (some text), Bad Subtype - (some text)"

sub("^[^,]*,", "", x)
#[1] " Variant - (some text), Bad Subtype - (some text) and my regex is returning: Bad Subtype - (some text) when the desired output is: Variant - (some text), Bad Subtype - (some text)"

Explanation

  1. ^ beginning of string;
  2. ^[^,]* any character at the beginning except for "," repeated zero or more times;
  3. ^[^,]*, the pattern in point 2 above followed by a comma.

This pattern replaced by the empty string "".

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
5

An option with trimws from base R

trimws(x, whitespace = "^[^,]+,\\s*")

-output

#[1] "Variant - (some text), Bad Subtype - (some text) and my regex is returning: Bad Subtype - (some text) when the desired output is: Variant - (some text), Bad Subtype - (some text)"

data

x <- "OCR - (some text), Variant - (some text), Bad Subtype - (some text) and my regex is returning: Bad Subtype - (some text) when the desired output is: Variant - (some text), Bad Subtype - (some text)"
akrun
  • 874,273
  • 37
  • 540
  • 662