1

I have a HTML representation of a novel, that looks like

<p class="intro">Peter said "Hello, son." The boy looked around in shock.</p>

<p class="secondary">"Who's there?!", he yelled, <span class="emphasis">terrified</span>.</p>

I want to (A) count the number of words that appear in the novel, and (B) count the number of words that appear in dialogue (that is, between quotation marks). Obviously I want to exclude the HTML tags from both counts. I also want to account for the fact that a book might use " " for its quotation marks, or “ ”, or ‘ ’, or even ' ' (which might be impossible, given apostrophes).

This is in a web-browser environment, if it matters, and the novel is stored as an array of strings. (The user supplies an EPUB file, which is a zipped set of HTML files, each of which gets read into a UTF-8 string.)

I could use regular expressions to do this, but is that the sensible solution? Given that I'm doing it in a web browser, would it be smarter to use the DOM Parser API in some way? Are there existing tools for this sort of parsing that I might not be aware of?

GreenTriangle
  • 2,382
  • 2
  • 21
  • 35
  • 2
    You remove all the markup by asking an XHTML parser to parse the document (because epub does not contain HTML, it contains XHTML) and then give you the collapsed text content, so that you can count words, and then you write (or find and use) a tokeniser that knows how to deal with your particular novel's quotation rules. What you _can't_ do is ["use regular expressions"](https://stackoverflow.com/a/133684/740553). This is exactly the kind of thing they cannot be used for. – Mike 'Pomax' Kamermans May 21 '20 at 23:44
  • (As for "are there tools?", that's mostly a matter of applying some elbow grease to searching the web. Finding tools to count words is trivial, finding tools that let you count words "inside quotes" is harder, but rephrase the problem: if you first isolate all quotes, then counting the words in what's left does the same, but now you have an alternative thing to search for) – Mike 'Pomax' Kamermans May 21 '20 at 23:50

1 Answers1

0

I think the DOM parser is the way to go. Here's a tidbit to get you started...

<html><head>

</head><body>

<div id='novel'>
  <p class="intro">Peter said "Hello, son." The boy looked around in shock.</p>
  <p class="secondary">"Who's there?!", he yelled, <span class="emphasis">terrified</span>.</p>
</div>

<script>

function extractText( domId ) {
  let parentDom = document.getElementById( domId );
  let paragraphs = parentDom.getElementsByTagName( 'p' );
  
  for ( let p of paragraphs ) {
    let sentence = p.textContent;
    console.log( `Extracted String: ${sentence}` );
  }
  
}

extractText( 'novel' );

</script>

</body>

In short, it gathers all the DOM <p> elements into an array, and then iterates over them pulling out the textContent. From there, I think the use of replace to standardize the quotation marks (ie, change occurences of “ ”, ‘ ’, and ' ' to, say, ~) and then use split( '~' ) to break up the text between quoted and unquoted, and then use split( ' ' ) to break up the sentences into words, allowing you to count the words inside and outside of quote marks.

Hope this helps...

Trentium
  • 3,419
  • 2
  • 12
  • 19