I've been having some trouble with what appear to be hidden newline characters in strings gotten with the BeautifulSoup .find function. The code I have scans an html document and pulls out name, title, company, and country as strings. I type checked and saw they were strings and when I print them and check their length everything appears to be normal strings. But when I use them either in print("%s is a %s at %s in %s" % (name,title,company,country))
or outputWriter.writerow([name,title,company,country])
to write to a csv file I get extra linebreaks that did not appear to be there in the strings.
What's going on? Or can anyone point me in the right direction?
I'm new to Python and not sure where to look up everything I don't know so I'm asking here after spending all day trying to fix the problem. I've searched through google and several other stack overflow articles on stripping hidden characters, but nothing seems to work.
import csv
from bs4 import BeautifulSoup
# Open/create csvfile and prep for writing
csvFile = open("attendees.csv", 'w+', encoding='utf-8')
outputWriter = csv.writer(csvFile)
# Open HTML and Prep BeautifulSoup
html = open('WEB SUMMIT _ LISBON 2016 _ Web Summit Featured Attendees.html', 'r', encoding='utf-8')
bsObj = BeautifulSoup(html.read(), 'html.parser')
itemList = bsObj.find_all("li", {"class":"item"})
outputWriter.writerow(['Name','Title','Company','Country'])
for item in itemList:
name = item.find("h4").get_text()
print(type(name))
title = item.find("strong").get_text()
print(type(title))
company = item.find_all("span")[1].get_text()
print(type(company))
country = item.find_all("span")[2].get_text()
print(type(country))
print("%s is a %s at %s in %s" % (name,title,company,country))
outputWriter.writerow([name,title,company,country])