I am writing code to take scraped HTML and turn it into wordcounts and eventually a word cloud. I am using Beautiful Soup to read the HTML file.
I would like to pull the strings from all the h2 tags.
To test getting a string I used:
title = wheatleySoup.h2.text
print(type(title))
print(title)
which returned
<class 'str'>
TO M AE C E N A S.
However, when I add that title to a list:
wheatleyPoems = []
wheatleyPoems.append(title)
print(type(wheatleyPoems[0]))
print(wheatleyPoems)
I get the following:
<class 'str'>
['\n TO\xa0\xa0M AE C E N A S.\n ']
Why are these different? Is there a way to get the nice raw str (as shown in the first return) in a list? I know I can get the bs4.element.NavigableStrings into a list, but I'd like to just have str types if possible.
The above was just testing the code and figuring out the data types. I would eventually need all the titles and will be using .find_all()
I have already built this code:
titles = wheatleySoup.find_all('h2')
for item in titles:
item=str(item)
print(item)
wheatleyTitles.append(item)
Which takes the bs4.element.NavigableStrings and turns them into str types. But, then I have a list that looks like this:
['\n TO\xa0\xa0M AE C E N A S.\n ', '\n O N\xa0\xa0V I R T U E.\n ', '\n TO THE UNIVERSITY OF CAMBRIDGE, IN NEW-ENGLAND.\n ', '\n TO THE KING’S MOST EXCELLENT MAJESTY. 1768.\n ', '\n ON BEING BROUGHT FROM AFRICA TO AMERICA.\n ', '\n ON THE DEATH OF THE REV. DR. SEWELL, 1769.\n
Instead of the nice clean strings I get when I use:
title = wheatleySoup.h2.text
print(type(title))
print(title)
Return:
<class 'str'>
TO M AE C E N A S.
as shown above.
I have also used Beautiful Soup's .string to store the navigable strings in a list, but I would prefer to have str instead.
I have also created a list as shown above and written a regex to clean it, I'm just wondering if I can cut down that step.