I am struggling to remove regexm split text into paragraph and then apply IFELSE to a dataframe. I look forward to your help. Thank you.
I wish to search for words in the first paragraph for each Text in the dataframe. Thereafter, I have search words I want to search for. If the words present, enter a 1, else 0.
Below is the table.
data<-structure(list(ID = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "1", "2"), class = "factor"),
Text = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = c("", "\\n\\t\\t\\t\\t \\n\\t\\t\\t\\t\\tPublication Date: October 31, 2017\\n\\t\\t\\t\\t October 31, 2017 he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.",
"\\n\\t\\t\\t\\t\\t \\n \\n The soccer world cup is entralling. \\nEveryone acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
), class = "factor")), .Names = c("ID", "Text"), row.names = c(NA,
-15L), class = "data.frame")
For number of entries in the Text column, the following words I am searching for
library(stringr)
library(stringi)
library(tidyverse)
library(tidytext)
library(tokenizers)
library(dplyr)
words<-c("field", "ocean", "glamor showcases")
I have tried the following:
Removing unwanted regex.
When I try to remove "\t" and "\n", I get following error:
data1<-data %>% mutate(Text=gsub("\\t",Text,""))
Warning message: In gsub("\t", Text, "") : argument 'replacement' has length > 1 and only the first element will be used
Split by paragraph
data1<-data %>% mutate(Text2=Text) %>% unnest_tokens("Text3",Text2,token="paragraphs")
If word is present, then 1, else 0 and final table.
finaldata<-structure(list(ID = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "1", "2"), class = "factor"),
Text = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = c("", "\\n\\t\\t\\t\\t \\n\\t\\t\\t\\t\\tPublication Date: October 31, 2017\\n\\t\\t\\t\\t October 31, 2017 he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.",
"\\n\\t\\t\\t\\t\\t \\n \\n The soccer world cup is entralling. \\nEveryone acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
), class = "factor"), field = structure(c(2L, 3L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"0", "1"), class = "factor"), country = structure(c(3L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"0", "1"), class = "factor"), glamor.showcases = structure(c(2L,
3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"0", "1"), class = "factor")), .Names = c("ID", "Text", "field",
"country", "glamor.showcases"), row.names = c(NA, -15L), class = "data.frame")
Any help would be appreciated. Thank you.
I have seen the following resources -