I have a data frame where one column is a character vector and every element in the vector is the full text of a document. I want to truncate words in each element so that maximum word length is 5 characters.
For example:
a <- c(1, 2)
b <- c("Words longer than five characters should be truncated",
"Words shorter than five characters should not be modified")
df <- data.frame("file" = a, "text" = b, stringsAsFactors=FALSE)
head(df)
file text
1 1 Words longer than five characters should be truncated
2 2 Words shorter than five characters should not be modified
And this is what I'm trying to get:
file text
1 1 Words longe than five chara shoul be trunc
2 2 Words short than five chara shoul not be modif
I've tried using strsplit() and strtrim() to modify each word (based in part on split vectors of words by every n words (vectors are in a list)):
x <- unlist(strsplit(df$text, "\\s+"))
y <- strtrim(x, 5)
y
[1] "Words" "longe" "than" "five" "chara" "shoul" "be" "trunc" "Words" "short" "than"
[12] "five" "chara" "shoul" "not" "be" "modif"
But I don't know if that's the right direction, because I ultimately need the words in a data frame associated with the correct row, as shown above.
Is there a way to do this using gsub and regex?