1

I have a data frame where one column is a character vector and every element in the vector is the full text of a document. I want to truncate words in each element so that maximum word length is 5 characters.

For example:

a <- c(1, 2)
b <- c("Words longer than five characters should be truncated",
       "Words shorter than five characters should not be modified")
df <- data.frame("file" = a, "text" = b, stringsAsFactors=FALSE)

head(df)
  file                                                      text
1    1     Words longer than five characters should be truncated
2    2 Words shorter than five characters should not be modified

And this is what I'm trying to get:

  file                                           text
1    1     Words longe than five chara shoul be trunc
2    2 Words short than five chara shoul not be modif

I've tried using strsplit() and strtrim() to modify each word (based in part on split vectors of words by every n words (vectors are in a list)):

x <- unlist(strsplit(df$text, "\\s+"))
y <- strtrim(x, 5)
y
[1] "Words" "longe" "than"  "five"  "chara" "shoul" "be"    "trunc" "Words" "short" "than" 
[12] "five"  "chara" "shoul" "not"   "be"    "modif"

But I don't know if that's the right direction, because I ultimately need the words in a data frame associated with the correct row, as shown above.

Is there a way to do this using gsub and regex?

Community
  • 1
  • 1
Jim
  • 21
  • 4

2 Answers2

1

If you're looking to utilize gsub to perform this task:

> df$text <- gsub('(?=\\b\\pL{6,}).{5}\\K\\pL*', '', df$text, perl=T)
> df
#   file                                           text
# 1    1     Words longe than five chara shoul be trunc
# 2    2 Words short than five chara shoul not be modif
hwnd
  • 69,796
  • 4
  • 95
  • 132
  • This works for the example text, but my real data includes characters such as "ü" (U+00FC) and "ı" (U+0131) and this code isn't working on words containing these characters. If it could be edited to do so, this is a nice one-line solution. – Jim Jun 03 '15 at 22:34
  • that performs as expected except for words beginning with characters such as ş, ı, and İ - they are being cut to 6 characters instead of 5. For example, "ştestword testşword ğtestword testğword İtestword testİword ütestword testüword" becomes "ştestw testş ğtestw testğ İtestw testİ ütest testü". – Jim Jun 04 '15 at 02:23
  • It's because of the word boundary `\b`, try removing it – hwnd Jun 04 '15 at 02:26
0

You were on the right track. In order for your idea to work, however, you have to do the split/trim/combine for each row separated. Here's a way to do it. I was very verbose on purpose, to make it clear, but you can obviously use less lines.

df$text <- sapply(df$text, function(str) {
  str <- unlist(strsplit(str, " "))
  str <- strtrim(str, 5)
  str <- paste(str, collapse = " ")
  str
})

And the output:

> df
  file                                           text
1    1     Words longe than five chara shoul be trunc
2    2 Words short than five chara shoul not be modif

The short version is

df$text <- sapply(df$text, function(str) {
  paste(strtrim(unlist(strsplit(str, " ")), 5), collapse = " ")  
})

Edit:

I just realized you asked if it is possible to do this using gsub and regex. Even though you don't need those for this, it's still possible, but harder to read:

df$text <- sapply(df$text, function(str) {
  str <- unlist(strsplit(str, " "))
  str <- gsub("(?<=.{5}).+", "", str, perl = TRUE)
  str <- paste(str, collapse = " ")
  str
})

The regex matches anything that appears after 5 characters and replaces those with nothing. perl = TRUE is necessary to enable the regex lookbehind ((?<=.{5})).

Molx
  • 6,816
  • 2
  • 31
  • 47
  • This does exactly what I need, and I appreciate the two different approaches - thanks! Will upvote when I have enough reputation (and then delete this comment to remove clutter). – Jim Jun 03 '15 at 22:36