0

I'm trying to turn a character vector novel.lower.mid into a list of single words. So far, this is the code I've used:

midnight.words.l <- strsplit(novel.lower.mid, "\\W")

This produces a list of all the words. However, it splits everything, including contractions. The word "can't" becomes "can" and "t". How do I make sure those words aren't separated, or that the function just ignores the apostrophe?

Stefano
  • 25
  • 7
  • What are your words delimited by ? Do you have sample data ? – steveb Jan 12 '16 at 02:50
  • @steveb I don't really know what my data is delimited by. Here's a sample part of data: class(novel.lower.mid) [1] "character" novel.lower.mid [1] " book one the perforated sheet i was born in the city of bombay...once upon a time. no, that won't do, there's no getting away from the date: i was born in doctor narlikar's nursing home on august 15th, 1947. and the time? the time matters, too. well then: at night. no, it's important to be more... – Stefano Jan 13 '16 at 01:20

2 Answers2

0

We can use

library(stringr)
str_extract_all(novel.lower.mid,  "\\b[[:alnum:]']+\\b")

Or

 strsplit(novel.lower.mid, "(?!')\\W", perl=TRUE)
akrun
  • 874,273
  • 37
  • 540
  • 662
0

If you just want your current "\W" split to not include apostrophes, negate \w and ':

novel.lower.mid <- c("I won't eat", "green eggs and", "ham")
strsplit(novel.lower.mid, "[^\\w']", perl=T)
# [[1]]
# [1] "I"     "won't" "eat"  
# 
# [[2]]
# [1] "green" "eggs"  "and"  
# 
# [[3]]
# [1] "ham"
mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • I tried your suggestion, but it doesn't stop running and then I have to force quit R studio – Stefano Jan 13 '16 at 01:08
  • @Stefano - that must be specific to your particular data then (perhaps there is a lot of it, or there is a specific encoding it is in). You'd have to provide more information! – mathematical.coffee Jan 13 '16 at 01:12
  • sorry about that! I'm working with a character vector which basically encompasses an entire novel, so about 300 pages worth of text. Here's how I got to novel.lower.mid: http://i.imgur.com/JF6B8Fh.png – Stefano Jan 13 '16 at 01:25