I have a data frame of character variables, containing long paragraphs, which I need to split up at positions determined by certain phrases. However the problem is that in many cases these phrases are merged with preceding words.
Here is what I am doing:
data <- readLines(n=2)
= DAY 1 CHALLENGES = syndicated.= DAY 2 CHALLENGES = Red Sea.= DAY 3 CHALLENGES = framework.= DAY 4 CHALLENGES = Did ;-)= DAY 5 CHALLENGES = Paste ...= DAY 6 CHALLENGES = Name
= DAY 1 CHALLENGES = very high.= DAY 2 CHALLENGES = Rank understand.= DAY 3 CHALLENGES = buy....= DAY 4 CHALLENGES = result.= DAY 5 CHALLENGES = coffee.= DAY 6 CHALLENGES = Bla.
df <- as.data.frame(data)
delim <- c("= DAY 1 CHALLENGES = ",
"= DAY 2 CHALLENGES = ",
"= DAY 3 CHALLENGES = ",
"= DAY 4 CHALLENGES = ",
"= DAY 5 CHALLENGES = ",
"= DAY 6 CHALLENGES = ")
y <- data.frame(do.call('rbind',
strsplit(as.character(df$data), delim, fixed = FALSE)))
y
X1
1
2 = DAY 1 CHALLENGES = very high.
X2
1 syndicated.= DAY 2 CHALLENGES = Red Sea.= DAY 3 CHALLENGES = framework.= DAY 4 CHALLENGES = Did ;-)= DAY 5 CHALLENGES = Paste ...= DAY 6 CHALLENGES = Name
2 Rank understand.= DAY 3 CHALLENGES = buy....= DAY 4 CHALLENGES = result.= DAY 5 CHALLENGES = coffee.= DAY 6 CHALLENGES = Bla.
I would like to get each = DAY x CHALLENGES = segment with the text until the next such segment as a separate variable.
Thanks!
Update with proposed methods:
> a <- scan(file ="~/Desktop/alm/a.txt", what="")
Read 1 item
> a
[1] "= DAY 1 CHALLENGES = very high.= DAY 2 CHALLENGES = Rank understand.= DAY 3 CHALLENGES = buy....= DAY 4 CHALLENGES = result. = DAY 5 CHALLENGES = Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc.com/ DAY 5 CHALLENGE: Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5 = DAY 6 CHALLENGES = Bla."
> b <- scan(file ="~/Desktop/alm/b.txt", what="")
Read 1 item
> b
[1] "= DAY 1 CHALLENGES = very high.= DAY 2 CHALLENGES = Rank understand.= DAY 3 CHALLENGES = buy....= DAY 4 CHALLENGES = result. Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc.com/ DAY 5 CHALLENGE: Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5 ?= DAY 6 CHALLENGES = Bla."
> c <- c(a,b)
> df <- as.data.frame(c)
> lst <- strsplit(gsub(" (?=\\= DAY)", ".", c, perl=TRUE),
+ '(?<=[.)])(?=\\=)', perl=TRUE)
> out <- do.call(cbind, lapply(lst, function(x) sub('^=.*= ', '', x)))
Warning message:
In (function (..., deparse.level = 1) :
number of rows of result is not a multiple of vector length (arg 2)
> out
[,1]
[1,] "very high."
[2,] "Rank understand."
[3,] "buy...."
[4,] "result.."
[5,] "Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc.com/ DAY 5 CHALLENGE: Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5."
[6,] "Bla."
[,2]
[1,] "very high."
[2,] "Rank understand."
[3,] "buy...."
[4,] "Bla." #this is not the value from the input file
[5,] "very high." #this is missing in the input file, yet a value is getting output
[6,] "Rank understand." #incorrect recognition of ?= DAY 6 CHALLENGES =; the same happens with := and != or similar
Problems are indicated in the comments. An indication of a missing value will be useful instead of a random one being inserted.