6

I have documents such as :

President Dr. Norbert Lammert: I declare the session open.

I will now give the floor to Bundesminister Alexander Dobrindt.

(Applause of CDU/CSU and delegates of the SPD)

Alexander Dobrindt, Minister for Transport and Digital Infrastructure:

Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.

(Volker Kauder [CDU/CSU]: Genau!)

(Applause of the CDU/CSU and the SPD)

And when I read those .txt documents I would like to create a second column indicating the speaker name.

So what I tried was to first create a list of all possible names and replace them..

library(qdap)

members <- c("Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","President Dr. Norbert Lammert:")
members_r <- c("@Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","@President Dr. Norbert Lammert:")

prok <- scan(".txt", what = "character", sep = "\n")
prok <- mgsub(members,members_r,prok)

prok <- as.data.frame(prok)
prok$speaker <- grepl("@[^\\@:]*:",prok$prok, ignore.case = T)

My plan was to then get the name between @ and : via regex if speaker == true and apply it downwards until there is a different name (and remove all applause/shout brackets obviously), but that is also where I am not sure how I could do that.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
erocoar
  • 5,723
  • 3
  • 23
  • 45
  • The reason you didn't get any attention on this question is mistagging, Why would you tag this with RStudio (which is completely irrelevant tag with almost no followers) and not with the R tag (which is the relevant tag and has almost 50K followers)? – David Arenburg Dec 16 '16 at 07:06
  • Thank you for pointing that out! I did not know, will definitely pay attention to that in the future :) – erocoar Dec 16 '16 at 11:09

3 Answers3

2

Here is the approach:

      require (qdap)
      #text is the document text

      # remove round brackets and text b/w ()
      a <- bracketX(text, "round") 

      names <- c("President Dr. Norbert Lammert","Alexander Dobrindt" )
      searchString <- paste(names[1],names[2], sep = ".+")

      # Get string from names[1] till names[2] with the help of searchString
      string <- regmatches(a, regexpr(searchString, a))

      # remove names[2] from string
      string <- gsub(names[2],"",string)

This code can be looped when there are more than 2 names

Sourabh
  • 73
  • 1
  • 18
  • Thank you for your help! Unfortunately, if I have multiple documents and more names I can't really apply every possible order to them (?), do you know how I could go about that? – erocoar Dec 16 '16 at 13:08
  • In my opinion, that is the only way unless there is a constant separator/symbol after script of each speaker. if '(' being used after script of each speaker, you can get all the text before that symbol and then run it across each speaker name to see whose name is mentioned. – Sourabh Dec 19 '16 at 04:43
1

This seems to work

library(qdap)

members <- c("Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","President Dr. Norbert Lammert:")
members_r <- c("@Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","@President Dr. Norbert Lammert:")

testprok <- read.table("txt",header=FALSE,quote = "\"",comment.char="",sep="\t")

testprok$V1 <- mgsub(members,members_r,testprok$V1)

testprok$V2 <- ifelse(grepl("@[^\\@:]*:",testprok$V1),testprok$V1,NA)       

####function from http://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value         
repeat.before = function(x) {   # repeats the last non NA value. Keeps leading NA
  ind = which(!is.na(x))      # get positions of nonmissing values
  if(is.na(x[1]))             # if it begins with a missing, add the 
    ind = c(1,ind)        # first position to the indices
  rep(x[ind], times = diff(   # repeat the values at these indices
    c(ind, length(x) + 1) )) # diffing the indices + length yields how often 
}                               # they need to be repeated

testprok$V2 = repeat.before(testprok$V2)
erocoar
  • 5,723
  • 3
  • 23
  • 45
1

Here is an approach leaning heavily on dplyr.

First, I added a sentence to your sample text to illustrate why we can't just use a colon to identify speaker names.

sampleText <-
"President Dr. Norbert Lammert: I declare the session open.

I will now give the floor to Bundesminister Alexander Dobrindt.

(Applause of CDU/CSU and delegates of the SPD)

Alexander Dobrindt, Minister for Transport and Digital Infrastructure:

Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.

(Volker Kauder [CDU/CSU]: Genau!)

(Applause of the CDU/CSU and the SPD)

This sentence right here: it is an example of a problem"

I then split the text to simulate the format that it appears you are reading it in (which also puts each speech in a part of a list).

splitText <- strsplit(sampleText, "\n")

Then, I am pulling out all of the potential speakers (anything that precedes a colon) to

allSpeakers <- lapply(splitText, function(thisText){
  grep(":", thisText, value = TRUE) %>%
    gsub(":.*", "", .) %>%
    gsub("\\(", "", .)
}) %>%
  unlist() %>%
  unique()

Which gives us:

[1] "President Dr. Norbert Lammert"                                        
[2] "Alexander Dobrindt, Minister for Transport and Digital Infrastructure"
[3] "Volker Kauder [CDU/CSU]"                                              
[4] "This sentence right here" 

Obviously, the last one is not a legitimate name, so should be excluded from our list of speakers:

legitSpeakers <-
  allSpeakers[-4]

Now, we are ready to work through the speech. I have included stepwise comments below, instead of describing in text here

speechText <- lapply(splitText, function(thisText){

  # Remove applause and interjections (things in parentheses)
  # along with any blank lines; though you could leave blanks if you want
  cleanText <-
    grep("(^\\(.*\\)$)|(^$)", thisText
         , value = TRUE, invert = TRUE)

  # Split each line by a semicolor
  strsplit(cleanText, ":") %>%
    lapply(function(x){
      # Check if the first element is a legit speaker
      if(x[1] %in% legitSpeakers){
        # If so, set the speaker, and put the statement in a separate portion
        # taking care to re-collapse any breaks caused by additional colons
        out <- data.frame(speaker = x[1]
                          , text = paste(x[-1], collapse = ":"))
      } else{
        # If not a legit speaker, set speaker to NA and reset text as above
        out <- data.frame(speaker = NA
                          , text = paste(x, collapse = ":"))
      }
      # Return whichever version we made above
      return(out)
    }) %>%
    # Bind all of the rows together
    bind_rows %>%
    # Identify clusters of speech that go with a single speaker
    mutate(speakingGroup = cumsum(!is.na(speaker))) %>%
    # Group by those clusters
    group_by(speakingGroup) %>%
    # Collapse that speaking down into a single row
    summarise(speaker = speaker[1]
              , fullText = paste(text, collapse = "\n"))
})

This yields

[[1]]

speakingGroup  speaker                                                                fullText                                                                                                                                                                                                                                        

            1  President Dr. Norbert Lammert                                          I declare the session open.\nI will now give the floor to Bundesminister Alexander Dobrindt.                                                                                                                                                     
            2  Alexander Dobrindt, Minister for Transport and Digital Infrastructure  Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.\nThis sentence right here: it is an example of a problem

If you prefer to have each line of text separately, replace the summarise at the end with mutate(speaker = speaker[1]) and you will get one line for each line of the speech, like this:

speaker                                                                text                                                                                                                                                                                      speakingGroup
President Dr. Norbert Lammert                                          I declare the session open.                                                                                                                                                                           1
President Dr. Norbert Lammert                                          I will now give the floor to Bundesminister Alexander Dobrindt.                                                                                                                                       1
Alexander Dobrindt, Minister for Transport and Digital Infrastructure                                                                                                                                                                                                        2
Alexander Dobrindt, Minister for Transport and Digital Infrastructure  Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.              2
Alexander Dobrindt, Minister for Transport and Digital Infrastructure  This sentence right here: it is an example of a problem                                                                                                                                               2
Mark Peterson
  • 9,370
  • 2
  • 25
  • 48