I'm trying to create a data frame of an TEI-XML version of Moby Dick using Hadley Wickham's xml2
package. I want the data frame to ultimately look like this (for all the words in the novel):
df <- data.frame(
chapter = c("1", "1", "1"),
words = c("call", "me", "ishmael"))
I'm able to get pieces, but not the whole thing. Here's what I've got so far:
library("xml2")
# Read file
melville <- read_xml("data/melville.xml")
# Get chapter divs (remember, doesn't include epilogue)
chap_frames <- xml_find_all(melville, "//d1:div1[@type='chapter']", xml_ns(melville))
This gives us a list with a length of 134 (that is, each of the chapters). We can get the chapter number for a specific element as follow:
xml_attr(chap_frames[[1]], "n")
We can get the paragraphs of a specific chapter (that is, minus the chapter heading) as follows:
words <- xml_find_all(chap_frames[[1]], ".//d1:p", xml_ns(melville)) %>% # remember doesn't include epilogue
xml_text()
And we can get the words of the chapters as follows:
# Split words function
split_words <- function (ll) {
result <- unlist(strsplit(ll, "\\W+"))
result <- result[result != ""]
tolower(result)
}
# Apply function
words <- split_words(words)
What I can't figure out is how to get the chapter number for each of the words. I had a toy example that worked:
mini <- read_xml(
'
<div1 type="chapter" n="1" id="_75784">
<head>Loomings</head>
<p rend="fiction">Call me Ishmael.</p>
<p rend="fiction">There now is your insular city of the Manhattoes, belted round by wharves as Indian isles by coral reefs- commerce surrounds it with her surf.</p>
</div1>
')
# Function
process_chap <- function(div){
chapter <- xml_attr(div, "n")
words <- xml_find_all(div, "//p") %>%
xml_text()
data.frame(chapter = chapter,
word = split_words(words))
}
process_chap(mini)
But it doesn't work for the longer example
process_chap2 <- function(div){
chapter <- xml_attr(div, "n")
words <- xml_find_all(div, ".//d1:p", xml_ns(melville)) %>% # remember doesn't include epilogue
xml_text()
data.frame(chapter = chapter,
word = split_words(words))
}
# Fails because there are more words than chapter names
df <- process_chap2(chap_frames)
# Gives all the words p (not chapters), chapter numbers are `NULL`.
df2 <- process_chap2(melville)
(I know why toy example works but the Melville ones doesn't, but I wanted to include it to show what I'm trying to do). I'm guessing I might need a loop of some sort, but I'm not sure where to begin. Any suggestions?
PS: I'm not entirely sure if I should link to an xml version of Moby Dick I found on Github, but you can find it easily enough searching for melville1.xml
.