Unknown errors in webcrawler-dictionary (Python, modules: beautifulsoup4, operator, requests)

Question

I am a beginner at python and I have developed a program that is is meant to crawl a website (that sells things) and print out the frequency of different words in the titles of the different items on sale.

There are three functions in my program: 1) A function that takes the text of the website and refines it to make a string 2) A function that takes that string and cleans it up, getting rid of things like brackets, commas, asterisks etc. 3) A function that then takes this string and sorts the words by how many times they are written on the website

I had an error in this program with my BeautifulSoup4 module, this other post helped me get rid of it: How to get rid of BeautifulSoup user warning? Although this made two more errors in my program: 1) An error with the link I put into the first function

File "/Users/lowryj1/PycharmProjects/untitled2/Jaer.py", line 39, in <module>
start('https://hongkong.asiaxpat.com/classifieds/glassware/')

And this is the code that is wrong (The link is the website I'm crawling):

start('https://hongkong.asiaxpat.com/classifieds/glassware/')

2) This in an error with my line of code where I try to split the string in the first function and make all of the characters lowercase, this just makes this error:

File "/Users/lowryj1/PycharmProjects/untitled2/Jaer.py", line 11, in start
words = content.lower().split()
AttributeError: 'NoneType' object has no attribute 'lower'

And this is the code that is wrong:

words = content.lower().split()

This is the area I have the error (url is where my website url comes in):

def start(url):
word_list = []
source_code = requests.get(url).text
soup = BeautifulSoup(source_code, "html5lib")
for post_text in soup.findAll('a', {'target': '_blank'}):
    content = post_text.string
    **words = content.lower().split()**

I have tried my best to solve these problems, most solutions I've tried only make the issues worse. Please help me solve these errors, as I was unable to find adequate solutions to this problem via research.

I need your code snippets to help. Please, add. Don't too much code, but all the things related to your problem, probably shortened details somewhere. — Nikolay Prokopyev, Feb 03 '17 at 09:16
it looks like `content = post_text.string` is blank so that when you try to call `.lower()` on it in the next line `words = content.lower().split()` its throwing the error. Can you check if there is content in `post_text.string`? — Craicerjack, Feb 03 '17 at 09:29
@Craicerjack When you say check, does printing out post_text.string work with checking it? If so, there is content in post_text.string. — J. Lowry, Feb 03 '17 at 09:35

score 0 · Answer 1 · answered Feb 03 '17 at 09:32

0

At first, I see slightly different syntax for find_all in the docs for bs4.

But assuming your syntax also correct, it fails with issue that some of the found post_texts has no textual content (i.e. .string) and returns None. You need check your anchors for it, probably it is an error in the sources.

But if you want just avoid the issue - use

if post_text.string is not None:
    content = post_text.string
    words = content.lower().split()

answered Feb 03 '17 at 09:32

Nikolay Prokopyev

1,260
12
22

That does seem to solve the issue, but three other problems popped up when I did that that I do not know how to solve. – J. Lowry Feb 03 '17 at 09:44
You're still specified too little data about other problems. Also, unfortunately, I think that you're actually asking to write the code for your three tasks instead of you. This isn't an SO style. So, try to ask about exact problems. For example: I have the code, but it gives me an error, or I want to achieve some goal and tried my code, but it doesn't work, or is there a method in some library to achieve my goal. Try it! :-) – Nikolay Prokopyev Feb 03 '17 at 10:01

Unknown errors in webcrawler-dictionary (Python, modules: beautifulsoup4, operator, requests)

1 Answers1