Splitting text using R

Question

I have a data frame of character variables, containing long paragraphs, which I need to split up at positions determined by certain phrases. However the problem is that in many cases these phrases are merged with preceding words.

Here is what I am doing:

data  <- readLines(n=2)
= DAY 1 CHALLENGES = syndicated.= DAY 2 CHALLENGES = Red Sea.= DAY 3 CHALLENGES = framework.= DAY 4 CHALLENGES = Did ;-)= DAY 5 CHALLENGES = Paste ...= DAY 6 CHALLENGES = Name 
= DAY 1 CHALLENGES = very high.= DAY 2 CHALLENGES = Rank understand.= DAY 3 CHALLENGES = buy....= DAY 4 CHALLENGES = result.= DAY 5 CHALLENGES = coffee.= DAY 6 CHALLENGES = Bla.

df  <- as.data.frame(data)

delim  <- c("= DAY 1 CHALLENGES = ",
            "= DAY 2 CHALLENGES = ",
            "= DAY 3 CHALLENGES = ",
            "= DAY 4 CHALLENGES = ",
            "= DAY 5 CHALLENGES = ",
            "= DAY 6 CHALLENGES = ")

y  <- data.frame(do.call('rbind',
                         strsplit(as.character(df$data), delim, fixed = FALSE)))
y
                               X1
1                                
2 = DAY 1 CHALLENGES = very high.
                                                                                    X2
1 syndicated.= DAY 2 CHALLENGES = Red Sea.= DAY 3 CHALLENGES = framework.= DAY 4 CHALLENGES = Did ;-)= DAY 5 CHALLENGES = Paste ...= DAY 6 CHALLENGES = Name 
2                               Rank understand.= DAY 3 CHALLENGES = buy....= DAY 4 CHALLENGES = result.= DAY 5 CHALLENGES = coffee.= DAY 6 CHALLENGES = Bla.

I would like to get each = DAY x CHALLENGES = segment with the text until the next such segment as a separate variable.

Thanks!

Update with proposed methods:

> a  <- scan(file ="~/Desktop/alm/a.txt", what="")
Read 1 item
> a
[1] "= DAY 1 CHALLENGES = very high.= DAY 2 CHALLENGES = Rank understand.= DAY 3 CHALLENGES = buy....= DAY 4 CHALLENGES = result. = DAY 5 CHALLENGES = Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc.com/ DAY 5 CHALLENGE: Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5 = DAY 6 CHALLENGES = Bla."
> b  <- scan(file ="~/Desktop/alm/b.txt", what="")
Read 1 item
> b
[1] "= DAY 1 CHALLENGES = very high.= DAY 2 CHALLENGES = Rank understand.= DAY 3 CHALLENGES = buy....= DAY 4 CHALLENGES = result. Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc.com/ DAY 5 CHALLENGE: Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5 ?= DAY 6 CHALLENGES = Bla."
> c <- c(a,b)
> df  <- as.data.frame(c)
> lst <- strsplit(gsub(" (?=\\= DAY)", ".", c, perl=TRUE), 
+                 '(?<=[.)])(?=\\=)', perl=TRUE)
> out <-  do.call(cbind, lapply(lst, function(x) sub('^=.*= ', '', x)))
Warning message:
In (function (..., deparse.level = 1)  :
  number of rows of result is not a multiple of vector length (arg 2)
> out
     [,1]                                                                                                                                                                                                                                                                                
[1,] "very high."                                                                                                                                                                                                                                                                        
[2,] "Rank understand."                                                                                                                                                                                                                                                                  
[3,] "buy...."                                                                                                                                                                                                                                                                           
[4,] "result.."                                                                                                                                                                                                                                                                          
[5,] "Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc.com/ DAY 5 CHALLENGE: Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5."
[6,] "Bla."                                                                                                                                                                                                                                                                              
     [,2]              
[1,] "very high."      
[2,] "Rank understand."
[3,] "buy...."         
[4,] "Bla." #this is not the value from the input file           
[5,] "very high." #this is missing in the input file, yet a value is getting output      
[6,] "Rank understand." #incorrect recognition of ?= DAY 6 CHALLENGES =; the same happens with := and != or similar

Problems are indicated in the comments. An indication of a missing value will be useful instead of a random one being inserted.

You will have to do better than that, please give us some example input, and some example output. — Mike Wise, Mar 14 '15 at 13:16
You could refer this link http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — akrun, Mar 14 '15 at 15:00
@StefanPetkov Updated again the post. As I mentioned earlier, I already spent some time with the post. If this doesn't work, you may not have provided all the patterns as I suggested.... — akrun, Mar 15 '15 at 14:36
I think it works, with the exception of the cases when there is ":= DAY x =". Then it fails. However "?= DAY x =" is picked up fine now. — Stefan Petkov, Mar 15 '15 at 21:09

akrun · Answer 1 · 2015-03-15T14:34:17.413

May be this helps

library(stringr)
str_extract_all(df$data, '= [A-Za-z]+ \\d+ [A-Za-z]+ = [A-Za-z ]+(\\.+| ;-\\)| \\.+| +)')
#[[1]]
#[1] "= DAY 1 CHALLENGES = syndicated." "= DAY 2 CHALLENGES = Red Sea."   
#[3] "= DAY 3 CHALLENGES = framework."  "= DAY 4 CHALLENGES = Did ;-)"    
#[5] "= DAY 5 CHALLENGES = Paste ..."   "= DAY 6 CHALLENGES = Name "      

#[[2]]
#[1] "= DAY 1 CHALLENGES = very high."      
#[2] "= DAY 2 CHALLENGES = Rank understand."
#[3] "= DAY 3 CHALLENGES = buy...."         
#[4] "= DAY 4 CHALLENGES = result."         
#[5] "= DAY 5 CHALLENGES = coffee."         
#[6] "= DAY 6 CHALLENGES = Bla."

Or using strsplit

 lst <- strsplit(as.character(df$data), '(?<=[.)])(?=\\=)', perl=TRUE)
 lst
 #[[1]]
 #[1] "= DAY 1 CHALLENGES = syndicated." "= DAY 2 CHALLENGES = Red Sea."   
 #[3] "= DAY 3 CHALLENGES = framework."  "= DAY 4 CHALLENGES = Did ;-)"    
 #[5] "= DAY 5 CHALLENGES = Paste ..."   "= DAY 6 CHALLENGES = Name "      

 #[[2]]
 #[1] "= DAY 1 CHALLENGES = very high."      
 #[2] "= DAY 2 CHALLENGES = Rank understand."
 #[3] "= DAY 3 CHALLENGES = buy...."         
 #[4] "= DAY 4 CHALLENGES = result."         
 #[5] "= DAY 5 CHALLENGES = coffee."         
 #[6] "= DAY 6 CHALLENGES = Bla."

If you want to extract the strings syndicated., very high. etc..

  do.call(cbind, lapply(lst, function(x) sub('^=.*= ', '', x)))
  #       [,1]          [,2]              
  #[1,] "syndicated." "very high."      
  #[2,] "Red Sea."    "Rank understand."
  #[3,] "framework."  "buy...."         
  #[4,] "Did ;-)"     "result."         
  #[5,] "Paste ..."   "coffee."         
  #[6,] "Name "       "Bla."

Update

Based on the updated string "a"

  lst <- strsplit(gsub(" (?=\\= DAY)", ".", a, perl=TRUE), 
                         '(?<=[.)])(?=\\=)', perl=TRUE)
  out <-  do.call(cbind, lapply(lst, function(x) sub('^=.*= ', '', x)))
  out[,1]
  #[1] "very high."                                                                                                                                                                                                                                                                        
  #[2] "Rank understand."                                                                                                                                                                                                                                                                  
  #[3] "buy...."                                                                                                                                                                                                                                                                           
  #[4] "result.."                                                                                                                                                                                                                                                                          
  #[5] "Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc.com/ DAY 5 CHALLENGE: Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5."
  #[6] "Bla."

Update2

I tried again on c (changed the object name to c1 as c is a function in R

  c1 <- c(a,b)
  c2 <- gsub("( |\\?)(?=\\= DAY)|\\.com. (?=DAY)", " .", c1, perl=TRUE)
  lst <- strsplit(c2, '(?<=[.)])(?=(\\=|DAY))', perl=TRUE)
  lst2 <- lapply(lst, function(x) unname(unlist(tapply(x,
      gsub('.*?DAY (\\d+).*', '\\1', x), FUN=paste, collapse= ' '))))
  out <- do.call(cbind,lapply(lst2, function(x)
       sub('^=[^=:]+(\\=|:) ', '', sub('^(?=DAY)', '= ', x, perl=TRUE))))

  out[,1]
  #[1] "very high."                                                                                                                                                                                                                                                                      
  #[2] "Rank understand."                                                                                                                                                                                                                                                                
  #[3] "buy...."                                                                                                                                                                                                                                                                         
  #[4] "result. ."                                                                                                                                                                                                                                                                       
  #[5] "Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc . DAY 5 CHALLENGE: Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5 ."
  #[6] "Bla."                                                                                               

 out[,2]
 #[1] "very high."                                                                                                                                                                             
 #[2] "Rank understand."                                                                                                                                                                       
 #[3] "buy...."                                                                                                                                                                                
 #[4] "result. Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc ."                                                                                                       
 #[5] "Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5  ."
 #[6] "Bla."

I guess I am already getting annoying, but I will continue asking. The last update works, but it only showed me that I have missing bits in my data. For example, if = DAY 6 CHALLENGES = is missing in one entry then the extraction order gets messed up for the whole data set...Can that be remedied somehow? Also, the `strsplit()` method seems to almost work, but it gets broken when there is := DAY... or ?= DAY... — Stefan Petkov, Mar 15 '15 at 09:40
@StefanPetkov Please do update your post with all the possible scenarios in your original post. In that way, it is easier to fix at one time rather than getting surprises every time. — akrun, Mar 15 '15 at 11:00

Splitting text using R

1 Answers1

Update

Update2