0

I'm trying to build a web crawler that generates a text file for multiple different websites. After it crawls a website it is supposed to get all the links in a website. However, I have encountered a problem while web crawling Wikipedia. The python script gives me the error:

Traceback (most recent call last):
  File "/home/banana/Desktop/Search engine/data/crawler?.py", line 22, in <module>
    urlwaitinglist.write(link.get('href'))
TypeError: write() argument must be str, not None

I looked deeper into it by having it print the discovered links and it has "None" at the top. I'm wondering if there is a function to see if the variable has any value.

Here is the code I have written so far:

from bs4 import BeautifulSoup
import os
import requests
import random
import re

toscan = "https://en.wikipedia.org/wiki/Wikipedia:Contents"
url = toscan
source_code = requests.get(url)
plain_text = source_code.text

removal_list = ["http://", "https://", "/"]

for word in removal_list:
    toscan = toscan.replace(word, "")

soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.find_all('a'):
    print(link.get('href'))
    urlwaitinglist = open("/home/banana/Desktop/Search engine/data/toscan", "a")
    urlwaitinglist.write('\n')
    urlwaitinglist.write(link.get('href'))
    urlwaitinglist.close()
    
print(soup.get_text())

directory = "/home/banana/Desktop/Search engine/data/Crawled Data/"

results = soup.get_text()

results = results.strip()

f = open("/home/banana/Desktop/Search engine/data/Crawled Data/" + toscan + ".txt", "w")
f.write(url)
f.write('\n')
f.write(results)
f.close()
Gino Mempin
  • 25,369
  • 29
  • 96
  • 135
Banana628
  • 11
  • 2

1 Answers1

1

Looks like not every <a> tag you are grabbing is returning a value. I would suggest making every link variable you grab a string and check if its not None. It is also bad practice to to open a file without using the 'with' clause. I have added an example that grabs every https|http link and writing it to file using some of your code below:

from bs4 import BeautifulSoup
import os
import requests
import random
import re

file_directory = './' # your specified directory location
filename = 'urls.txt' # your specified filename

url = "https://en.wikipedia.org/wiki/Wikipedia:Contents"
res = requests.get(url)
html = res.text
    
soup = BeautifulSoup(html, 'html.parser')
links = []

for link in soup.find_all('a'):
    link = link.get('href')
    print(link)
    match = re.search('^(http|https)://', str(link))
    if match:
        links.append(str(link))
    
    
with open(file_directory + filename, 'w') as file:
    for link in links:
        file.write(link + '\n')
Nolan Walker
  • 352
  • 1
  • 7