How to extract first occurrence of alphabets in a string in R?

Question

I have a character column having values like "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE". I want to extract characters "CHELSEAFC", "BARCAFC" and so on. Currently I am using regmatches(x$symbol,regexpr("[A-z]+",x$symbol)) but getting an error:

Error in $<-.data.frame(*tmp*, "cg", value = c("CHELSEAFC", "CHELSEAFC", "TOTTENHAMFC", : replacement has 11366767 rows, data has 11366772 Calls: $<- -> $<-.data.frame Execution halted

I can't seem to find the problem row. Please somebody help with debugging or suggest a better way to do this :)

Instead of transforming it within the data.frame, work with the vector itself. If you run `z <- regmatches(...)`, check: (1) is `z` still a vector vice a list; (2) try `table(nchar(z))` and see if the string lengths make sense; (3) if a list, it indicates that your assumption on string composition might have a hole in it. (BTW: `sub` is about 3-4x faster than `regmatches(...)`.) — r2evans, Feb 06 '17 at 06:46
Please provide some more code to repro. I guess some of the strings do not start with letters. — Wiktor Stribiżew, Feb 06 '17 at 07:40
Yeah, the problem was in some rows which should have been deleted. Thanks a lot. — Abhimanyu Singh, Feb 06 '17 at 18:26

score 1 · Answer 1 · answered Feb 06 '17 at 06:23

Assuming that we need to extract the non-numeric part, one option is to remove the other characters by matching one or more numbers ([0-9]+) followed by other characters (.*) and replace it with ""

sub("[0-9]+.*", "", str1)
#[1] "CHELSEAFC" "BARCAFC"

Or capture the upper case letters as a group (([A-Z]+)) from the start (^) of the string and replace it with the backreference (\\1) for that group

sub("^([A-Z]+).*", "\\1", str1)
#[1] "CHELSEAFC" "BARCAFC"

data

str1 <- c( "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE")

score 1 · Answer 2 · edited May 23 '17 at 12:26

1

Instead of [A-z]+ you should use ^[A-Za-z]+ Check this for more understanding why you shouldn't do that: https://stackoverflow.com/a/29771926/4082217

edited May 23 '17 at 12:26

Community

1
1

answered Feb 06 '17 at 06:24

Mohammad Yusuf

16,554
10
50
78

score 0 · Answer 3 · edited May 23 '17 at 10:29

The error appears because you have some values in the input vector that do not contain letters (and some symbols that [A-z] matches). That makes regmatches return no value in case there is no match, and thus, assigning the column values becomes impossible as the number of matches does not coincide with the number of rows in the data frame.

What you may do is:

1) Use sub

x <- c("------", "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE")
> sub("^([a-zA-Z]+).*|.*", "\\1", df$x)
[1] ""          "CHELSEAFC" "BARCAFC" 
> 
x$symbol <- sub("^([a-zA-Z]+).*|.*", "\\1", x$symbol)

The ^([a-zA-Z]+).*|.* pattern will match and capture one or more ASCII letters (replace [a-zA-Z]+ with [[:alpha:]]+ to match letters other than ASCII, too) at the start of the string (^), and .* will match the rest of the string, OR (|) the whole string will get matches with the second branch and the match will be replaced with the capturing group contents (so, it will be either filled with a letter value or will be empty).

2) If you want to keep NA for the values with no match, use stringr str_extract:

library(stringr)
> x$symbol <- str_extract(x$symbol, "^[A-Za-z]+")
## => 1      <NA>
##    2 CHELSEAFC
##    3   BARCAFC

Note that ^[A-Za-z]+ matches 1+ ASCII letters ([A-Za-z]+) at the start of the string only (^).

How to extract first occurrence of alphabets in a string in R?

3 Answers3

data