4

I have a large csv with a text column that has a max width of 200. In nearly all cases the data is fine. In some cases, the data is too long or has not quite been filled in properly, i would like to use regex to find the last instance of a specific numeric/character pairing and then remove everything after it.

eg data:

df <- data.frame(ID = c("1","2","3"),
             text = c("A|explain what a is|12.2|Y|explain Y|2.36|",
                 "A|explain what a is|15.2|E|explain E|10.2|E|explain E but run out hal",
                 "D|explain what d is|0.48|Z|explain z but number 5 is present|"))

My specific character pair is any number followed by a |

This would mean Row 1 is fine, row 2 would have everything after '10.2' removed and row 3 would have everything after 0.48 removed

I tried this regex:

df[,2] <- sub("([^0-9]+[^|]*$)", "", df[,2])

It very nearly nearly worked but the very few rows in my data that have a number present in the explanation do not play along. Any clues? I'm not a great regexer yet, learning the ropes

I saw this question about grouping, but couldn't quite apply it to my problem.

Community
  • 1
  • 1
Sam
  • 1,400
  • 13
  • 29

1 Answers1

3

Using sub, we capture one or more characters (.*) followed by one of more numbers, followed by a dot if present (\\.?) followed by one or more numbers as a group followed by | and the rest of the characters until the end of the string. In the replacement, the capture group is specified (\\1).

sub('^(.*[0-9]+\\.?[0-9]+)\\|.*$', '\\1', df$text)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I'm not gonna post a separate answer with almost the same code. I was just asking because I've not found anywhere in the Q the specification for only numbers with decimals and your answer won't work with integers, that's all... – Cath Dec 09 '15 at 12:48
  • @CathG In the third row number 5 is also removed. – akrun Dec 09 '15 at 12:50
  • @CathG Today, for most of the time, I get this stackover javascript not running. So, I am just posting the answer whenever there is a tiny window where it works. Even, I couldn't write the descriptions. So, you may be right,.. – akrun Dec 09 '15 at 12:53
  • @CathG im afraid yours does not work, it removes too much text, but thanks – Sam Dec 09 '15 at 13:01
  • 1
    @akrun that seems to do the job perfectly, and is what i was scrabbling around with earlier but couldnt quite get to, thanks a lot – Sam Dec 09 '15 at 13:02
  • 2
    @Sam, actually, it was supposed to be `sub('(.*[0-9]+)\\|.*', '\\1', df$text)`, I was just commenting on the `\\.[0-9]+` that will only keep decimal numbers – Cath Dec 09 '15 at 13:03
  • 2
    @CathG oh i see, thanks, i actually like yours better! cheers both – Sam Dec 09 '15 at 13:06
  • I like it when answers show some input/ output. Makes it easier to see if they should be up voted! – rbatt Dec 09 '15 at 13:46