0

I have found a post, which is very similar to my problem. I have a data.frame with a list of keywords, which are separated by semicolons in one coloumn and the year in another column. I would like to unlist the keywords without losing the information about the year.

I can separate the keywords with strsplit and unlist

keywords <- unlist(strsplit(df$keywords,";"))
l1 <- sapply(df$keywords, length)
Year <- rep(df$Year, l1)
length(Year)
length(keywords)
dfkeywords=data.frame(Year=Year, Keywords=keywords, stringsAsFactors = F)

but I fail to generate a vector of the year that is the same length as the keywords vector.

How do I do that in a smart way?

Best Pete

Community
  • 1
  • 1
PeterGerft
  • 49
  • 1
  • 7

2 Answers2

0

Calculate the lengths before unlisting the split keywords. So split the keywords

keywords = strsplit(df$keywords,";")

find the lengths (number of key words) in each record

lens = lengths(keywords)

create the data.frame

data.frame(Year=rep(df$Year, lens), Keywords=unlist(keywords),
           stringsAsFactors=FALSE)
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
0

Assuming you have something that looks like this:

df <- data.frame(keywords = c("some;text", "some;other;text", "even;more;text;here"),
                 Year = c(2025, 2026, 2099))
df
#              keywords Year
# 1           some;text 2025
# 2     some;other;text 2026
# 3 even;more;text;here 2099

Then I would suggest that you just consider using cSplit from my "splitstackshape" package.

library(splitstackshape)
cSplit(df, "keywords", ";", "long")
#    keywords Year
# 1:     some 2025
# 2:     text 2025
# 3:     some 2026
# 4:    other 2026
# 5:     text 2026
# 6:     even 2099
# 7:     more 2099
# 8:     text 2099
# 9:     here 2099

Other approaches to consider would be:

"dplyr" + "tidyr"

library(dplyr)
library(tidyr)
df %>%
  mutate(keywords = strsplit(as.character(keywords), ";")) %>%
  unnest(keywords)

"data.table"

library(data.table)
as.data.table(df)[, list(keywords = unlist(strsplit(as.character(keywords), ";"))), 
                  by = Year]
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485