I have a HTML representation of a novel, that looks like
<p class="intro">Peter said "Hello, son." The boy looked around in shock.</p>
<p class="secondary">"Who's there?!", he yelled, <span class="emphasis">terrified</span>.</p>
I want to (A) count the number of words that appear in the novel, and (B) count the number of words that appear in dialogue (that is, between quotation marks). Obviously I want to exclude the HTML tags from both counts. I also want to account for the fact that a book might use " "
for its quotation marks, or “ ”
, or ‘ ’
, or even ' '
(which might be impossible, given apostrophes).
This is in a web-browser environment, if it matters, and the novel is stored as an array of strings. (The user supplies an EPUB file, which is a zipped set of HTML files, each of which gets read into a UTF-8 string.)
I could use regular expressions to do this, but is that the sensible solution? Given that I'm doing it in a web browser, would it be smarter to use the DOM Parser API in some way? Are there existing tools for this sort of parsing that I might not be aware of?