Text processing and analysis in R

Question

I am beginning the analysis in RStudio of an interview I have made. The interview is, normally, made of the interviewer's questions and the subject's answers.

text<- "Interviewer: Hello, how are you?
Subject: I am fine, thanks.

Interviewer: What is your name?
Subject: My name is Gerard."

I would like to remove all the interviewer's questions to be able to analyze the interview. I do not know how to proceed in R, actually, I do not even know what exactly to Google.

I would appreciate all the help I can get. Thank you in advance.

Is your data in a dataframe or in a vector ? – TarJae Jan 04 '23 at 17:02 — TarJae, Jan 04 '23 at 17:02

score 1 · Answer 1 · answered Jan 04 '23 at 17:15

base R:

text<- "Interviewer: Hello, how are you?
Subject: I am fine, thanks.

Interviewer: What is your name?
Subject: My name is Gerard."

this gives you

text
[1] "Interviewer: Hello, how are you?\nSubject: I am fine, thanks.\n\nInterviewer: What is your name?\nSubject: My name is Gerard."

where the \n are that you split on with strsplit(

strsplit(text, '\n')[[1]] # strsplit returns a list
[1] "Interviewer: Hello, how are you?" "Subject: I am fine, thanks."     
[3] ""                                 "Interviewer: What is your name?" 
[5] "Subject: My name is Gerard."
text2 <- strsplit(text, '\n\)

text2[c(2,5)]
[1] "Subject: I am fine, thanks." "Subject: My name is Gerard."

TarJae · Answer 2 · 2023-01-04T17:20:04.790

0

If your data is a vector text as indicated in the question, we can do:

It seems that your data is stored in text -> then try this:

With as_tibble wit transform the vector to a tibble (+/- equal to data frame), then we separate the rows by \n and finally we filte:

library(dplyr)
library(tidyr)

text <- as_tibble(text) %>% 
  separate_rows(value, sep="\n") %>% 
  filter(!grepl("Interviewer", value) & value!="") %>% 
  pull(value)
text

[1] "Subject: I am fine, thanks." "Subject: My name is Gerard."

edited Jan 04 '23 at 17:20

answered Jan 04 '23 at 17:06

TarJae

72,363
6
19
66

Thank you for your quick response. I have been a bit imprecise, though. I will be importing interviews on txt Word files. Does the text still count as a vector? Pardon my coding expression illiteracy. – Janez Gorenc Jan 04 '23 at 17:15
2

`dput(my_word_text_example)`, and copy `structure(...)` above into your question as data. – Chris Jan 04 '23 at 17:18
1

For future questions please have a look here: – TarJae Jan 04 '23 at 17:21

score 0 · Answer 3 · answered Jan 04 '23 at 18:21

An approach using strsplit and sub/gsub.

text_new <- gsub("\n", "", sub(".*(Subject: )", "\\1", 
              unlist(strsplit(text, "Interviewer: "))))
text_new[nchar(text_new) > 0]
[1] "Subject: I am fine, thanks." "Subject: My name is Gerard."

First split the string using Interviewer:.
Since the first string includes Subject: remove the residual string until Subject: with sub
Remove existing newlines with gsub.
Finally select non-empty strings.

Text processing and analysis in R

3 Answers3