1

I am struggling to remove regexm split text into paragraph and then apply IFELSE to a dataframe. I look forward to your help. Thank you.

I wish to search for words in the first paragraph for each Text in the dataframe. Thereafter, I have search words I want to search for. If the words present, enter a 1, else 0.

Below is the table.

data<-structure(list(ID = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "1", "2"), class = "factor"), 
    Text = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("", "\\n\\t\\t\\t\\t \\n\\t\\t\\t\\t\\tPublication Date: October 31, 2017\\n\\t\\t\\t\\t October 31, 2017  he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.", 
    "\\n\\t\\t\\t\\t\\t \\n \\n   The soccer world cup is entralling. \\nEveryone  acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
    ), class = "factor")), .Names = c("ID", "Text"), row.names = c(NA, 
-15L), class = "data.frame")

For number of entries in the Text column, the following words I am searching for

library(stringr)
library(stringi)
library(tidyverse)
library(tidytext)
library(tokenizers)
library(dplyr)
words<-c("field", "ocean", "glamor showcases")

I have tried the following:

Removing unwanted regex.

When I try to remove "\t" and "\n", I get following error:

data1<-data %>% mutate(Text=gsub("\\t",Text,""))

Warning message: In gsub("\t", Text, "") : argument 'replacement' has length > 1 and only the first element will be used

Split by paragraph

data1<-data %>% mutate(Text2=Text) %>% unnest_tokens("Text3",Text2,token="paragraphs")

If word is present, then 1, else 0 and final table.

finaldata<-structure(list(ID = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "1", "2"), class = "factor"), 
    Text = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("", "\\n\\t\\t\\t\\t \\n\\t\\t\\t\\t\\tPublication Date: October 31, 2017\\n\\t\\t\\t\\t October 31, 2017  he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.", 
    "\\n\\t\\t\\t\\t\\t \\n \\n   The soccer world cup is entralling. \\nEveryone  acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
    ), class = "factor"), field = structure(c(2L, 3L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "0", "1"), class = "factor"), country = structure(c(3L, 2L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "0", "1"), class = "factor"), glamor.showcases = structure(c(2L, 
    3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "0", "1"), class = "factor")), .Names = c("ID", "Text", "field", 
"country", "glamor.showcases"), row.names = c(NA, -15L), class = "data.frame")

Any help would be appreciated. Thank you.

I have seen the following resources -

  1. Count word occurrences in R

  2. How to find that a word/words in a column is present in another column consisting a sentence [duplicate]

  3. Split by paragraph in R

  4. Split text file into paragraph files in R

Beginner
  • 262
  • 1
  • 4
  • 12

1 Answers1

1

You can try this assuming that a new paragraph in df$Textstarts from \n\n

#search df$Text to find if it contains strings present in 'words' vector in its first paragraph
words_df <- do.call(cbind, lapply(words, function(x) 
  as.numeric(grepl(x, gsub("\n\n.*$", "", df$Text), ignore.case = T))))
colnames(words_df) <- words

#above outcome is combined with original dataframe to have the final result
final_df <- cbind(df, words_df)

which gives

> final_df[, -(1:2)]
  field country glamor showcases
1     0       1                0
2     1       0                1


Sample data:

df <- structure(list(ID = structure(2:3, .Label = c("", "1", "2"), class = "factor"), 
    Text = structure(2:3, .Label = c("", "\\n\\t\\t\\t\\t \\n\\t\\t\\t\\t\\tPublication Date: October 31, 2017\\n\\t\\t\\t\\t October 31, 2017  he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.", 
    "\\n\\t\\t\\t\\t\\t \\n \\n   The soccer world cup is entralling. \\nEveryone  acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
    ), class = "factor")), .Names = c("ID", "Text"), row.names = 1:2, class = "data.frame")

words<-c("field", "country", "glamor showcases")
Prem
  • 11,775
  • 1
  • 19
  • 33