I'm trying to build a web crawler that generates a text file for multiple different websites. After it crawls a website it is supposed to get all the links in a website. However, I have encountered a problem while web crawling Wikipedia. The python script gives me the error:
Traceback (most recent call last):
File "/home/banana/Desktop/Search engine/data/crawler?.py", line 22, in <module>
urlwaitinglist.write(link.get('href'))
TypeError: write() argument must be str, not None
I looked deeper into it by having it print the discovered links and it has "None" at the top. I'm wondering if there is a function to see if the variable has any value.
Here is the code I have written so far:
from bs4 import BeautifulSoup
import os
import requests
import random
import re
toscan = "https://en.wikipedia.org/wiki/Wikipedia:Contents"
url = toscan
source_code = requests.get(url)
plain_text = source_code.text
removal_list = ["http://", "https://", "/"]
for word in removal_list:
toscan = toscan.replace(word, "")
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
urlwaitinglist = open("/home/banana/Desktop/Search engine/data/toscan", "a")
urlwaitinglist.write('\n')
urlwaitinglist.write(link.get('href'))
urlwaitinglist.close()
print(soup.get_text())
directory = "/home/banana/Desktop/Search engine/data/Crawled Data/"
results = soup.get_text()
results = results.strip()
f = open("/home/banana/Desktop/Search engine/data/Crawled Data/" + toscan + ".txt", "w")
f.write(url)
f.write('\n')
f.write(results)
f.close()