1

I've been having some trouble with what appear to be hidden newline characters in strings gotten with the BeautifulSoup .find function. The code I have scans an html document and pulls out name, title, company, and country as strings. I type checked and saw they were strings and when I print them and check their length everything appears to be normal strings. But when I use them either in print("%s is a %s at %s in %s" % (name,title,company,country)) or outputWriter.writerow([name,title,company,country]) to write to a csv file I get extra linebreaks that did not appear to be there in the strings.

What's going on? Or can anyone point me in the right direction?

I'm new to Python and not sure where to look up everything I don't know so I'm asking here after spending all day trying to fix the problem. I've searched through google and several other stack overflow articles on stripping hidden characters, but nothing seems to work.

import csv
from bs4 import BeautifulSoup

# Open/create csvfile and prep for writing
csvFile = open("attendees.csv", 'w+', encoding='utf-8')
outputWriter = csv.writer(csvFile)

# Open HTML and Prep BeautifulSoup
html = open('WEB SUMMIT _ LISBON 2016 _ Web Summit Featured Attendees.html', 'r', encoding='utf-8')
bsObj = BeautifulSoup(html.read(), 'html.parser')
itemList = bsObj.find_all("li", {"class":"item"})

outputWriter.writerow(['Name','Title','Company','Country'])

for item in itemList:
    name = item.find("h4").get_text()
    print(type(name))
    title = item.find("strong").get_text()
    print(type(title))
    company = item.find_all("span")[1].get_text()
    print(type(company))
    country = item.find_all("span")[2].get_text()
    print(type(country))
    print("%s is a %s at %s in %s" % (name,title,company,country))
    outputWriter.writerow([name,title,company,country])
Thom Wiggers
  • 6,938
  • 1
  • 39
  • 65
gsears
  • 13
  • 6
  • I solved my problem trying one more filter. def filter_non_printable(str): return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9]) – gsears Aug 30 '16 at 21:26

1 Answers1

0

Most likely you need to strip the whitespace, there is nothing in your code that adds it so it has to be there:

outputWriter.writerow([name.strip(),title.strip(),company.strip(),country.strip()])

You can verify what us there by seeing the repr outpout:

print("%r is a %r at %r in %r" % (name,title,company,country))

When you print you see the str output so if there is a newline you may not realise it is there:

In [8]: s = "string with newline\n"

In [9]: print(s)
string with newline


In [10]: print("%r" % s)
'string with newline\n'

difference-between-str-and-repr-in-python

If the newlines are actually embedded in the body off the strings, you will need to replace i.e name.replace("\n", " ")

Community
  • 1
  • 1
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Thanks! As I sad in my last comment I tried one more solution and found that it worked. I'm still not sure on the hows or whys of everything yet but I'm slowly learning. Thanks again! – gsears Aug 31 '16 at 15:37