Splitting numerals from string in data frame

Question

I have a data frame in R with a column that looks like this:

Venue
AAA 2001
BBB 2016
CCC 1996
... ....
ZZZ 2007

In order to make working with the dataframe slightly easier I wanted to split up the venue column into two columns, location and year, like so:

Location Year
AAA      2001
BBB      2016
CCC      1996
...      ....
ZZZ      2007

I have tried various variations of the cSplit() function to achieve this:

df = cSplit(df, "Venue", " ") #worked somewhat, however issues with places with multiple words (e.g. Los Angeles, Rio de Janeiro)
df = cSplit(df, "Venue", "[:digit:]")
df = cSplit(df, "Venue,", "[0-9]+")

None of these worked so far for me. I'd appreciate it if anyone could point me in the right direction.

Please provide a more representative example to include the things that are giving you problems. — Rich Scriven, Nov 10 '16 at 18:16
I'd suggest using `separate` from `tidyr`. `library(tidyr);separate(df, col = Venue, into = c("Location", "Year", sep = " ")` — Molx, Nov 10 '16 at 18:20
For those leaving comments and answers, OP's comment in his first attempt: `#worked somewhat, however issues with places with multiple words (e.g. Los Angeles, Rio de Janeiro)` is why splitting on a single space will not work. — Rich Scriven, Nov 10 '16 at 18:32
@RichScriven In that case even the marked duplicate isn't the right target. Isn't it? — Ronak Shah, Nov 10 '16 at 18:41

cdeterman · Answer 1 · 2016-11-10T19:27:14.927

0

The simplest way would be to use stringr which is automatically vectorized

library(stringr)

df[,1:2] <- str_split(df$Venue, pattern = "\\s+(?=\\d)", simplify = TRUE)
colnames(df) <- c('Location', 'Year')

or with str_split_fixed

str_split_fixed(df$Venue, pattern = "\\s+(?=\\d)", 2)

You can also do it with base R

df[,1:2] <- do.call(rbind, strsplit(df$Venue, split = "\\s+(?=\\d)", perl = TRUE))
colnames(df) <- c('Location', 'Year')

edited Nov 10 '16 at 19:27

answered Nov 10 '16 at 18:19

cdeterman

19,630
7
76
100

This would work if it weren't for some locations having multiple words. Los Angeles, for instance, would be split between the two columns due to the pattern being " ". Instead of displaying Los Angeles in one column and its corresponding year in the other. – S.Fischer Nov 10 '16 at 18:35
@S.Fischer That wasn't clear from your example, please provide a representative example. – cdeterman Nov 10 '16 at 19:03

Daniel Anderson · Accepted Answer · 2016-11-10T18:53:41.347

0

How about this?

d <- data.frame(Venue = c("AAA 2001", "BBB 2016", "CCC 1996", "cc d 2001"),
         stringsAsFactors = FALSE)

d$Location <- gsub("[[:digit:]]", "", d$Venue)
d$Year <- gsub("[^[:digit:]]", "", d$Venue)
d
#       Venue Location Year
# 1  AAA 2001     AAA  2001
# 2  BBB 2016     BBB  2016
# 3  CCC 1996     CCC  1996
# 4 cc d 2001    cc d  2001

edited Nov 10 '16 at 18:53

answered Nov 10 '16 at 18:20

Daniel Anderson

2,394
13
26

Due to some locations containing multiple words (e.g. Los Angeles), this will not work. Since the words in these locations are also split up by a space. – S.Fischer Nov 10 '16 at 18:38
I edited it. Hopefully that will work for you. – Daniel Anderson Nov 10 '16 at 18:53
1

This worked perfectly! Thank you for your help!! – S.Fischer Nov 10 '16 at 19:06

Splitting numerals from string in data frame

2 Answers2