The overall question has been asked and answered in a few places: http://www.resolvinghere.com/sof/18408799.shtml
How to get all text between just two specified tags using BeautifulSoup?
But in trying to implement, I am getting really cumbersome strings.
My setup: I'm trying to pull transcript text from the Presidential debates, and I thought I'd start here: http://www.presidency.ucsb.edu/ws/index.php?pid=111500
I can isolate just the transcript with
transcript = soup.find_all("span", class_="displaytext")[0]
The formatting of the transcript isn't ideal. Every few lines of text has a <p>
and they denote a change in speakers with a nested <b>
. eg:
<p><b>TRUMP:</b> First of all, I have to say, as a businessman, I get along with everybody. I have business all over the world. [<i>booing</i>]</p>,
<p>I know so many of the people in the audience. And by the way, I'm a self-funder. I don't have — I have my wife and I have my son. That's all I have. I don't have this. [<i>applause</i>]</p>,
<p>So let me just tell you, I get along with everybody, which is my obligation to my company, to myself, et cetera.</p>,
<p>Obviously, the war in Iraq was a big, fat mistake. All right? Now, you can take it any way you want, and it took — it took Jeb Bush, if you remember at the beginning of his announcement, when he announced for president, it took him five days.</p>,
<p>He went back, it was a mistake, it wasn't a mistake. It took him five days before his people told him what to say, and he ultimately said, "It was a mistake." The war in Iraq, we spent $2 trillion, thousands of lives, we don't even have it. Iran has taken over Iraq, with the second-largest oil reserves in the world.</p>,
<p>Obviously, it was a mistake.</p>,
<p><b>DICKERSON:</b> So...</p>
But like I said, not a new problem. Define a start and end tag, iterate through the elements, as long as current != next, add the text .
So I'm testing on a single element to get the details right.
startTag = transcript.find_all('b')[165]
endTag = transcript.find_all('b')[166]
content = []
content += startTag.string
content
And the results I get are [u'R', u'U', u'B', u'I', u'O', u':']
instead of [u'RUBIO:']
.
What am I missing?