3

I am new to python and I am trying to loop through the list of urls in a csv file and grab the website titleusing BeautifulSoup, which I would like then to save to a file Headlines.csv. But I am unable to grab the webpage title. If I use a variable with single url as follows:

url = 'https://www.space.com/japan-hayabusa2-asteroid-samples-landing-date.html'

resp = req.get(url)
soup = BeautifulSoup(resp.text, 'lxml')

print(soup.title.text)

It works just fine and I get the title Japanese capsule carrying pieces of asteroid Ryugu will land on Earth Dec. 6 | Space
But when I use the loop,

import csv
with open('urls_file2.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for url in reader:
        print(url)
        resp = req.get(url)
        soup = BeautifulSoup(resp.text, 'lxml')

        print(soup.title.text)

I get the following ['\ufeffhttps://www.foxnews.com/us/this-day-in-history-july-16']

and an error message

InvalidSchema: No connection adapters were found for "['\\ufeffhttps://www.foxnews.com/us/this-day-in-history-july-16']"

I am not sure what am I doing wrong.

Gargamel
  • 63
  • 7

2 Answers2

3

You have a byte order mark \\ufeff on the URL you parse from your file. It looks like your file is a signature file and has encoding like utf-8-sig.

You need to read with the file with encoding='utf-8-sig'

Read more here.

Timmy Chan
  • 933
  • 7
  • 15
  • Thank you I came across the issue you have linked yesterday, I was not sure which codec to use. I used your encoding, together with the solution that Raymond C. suggested and it works ! – Gargamel Jul 16 '20 at 11:51
2

As the previous answer has already mentioned about the "\ufeff", you would need to change the encoding.

The second issue is that when you read a CSV file, you will get a list containing all the columns for each row. The keyword here is list. You are passing the request a list instead of a string.

Based on the example you have given, I would assume that your urls are in the first column of the csv. Python lists starts with a index of 0 and not 1. So to extract out the url, you would need to extract the index of 0 which refers to the first column.

import csv

with open('urls_file2.csv', newline='', encoding='utf-8-sig') as f:
    reader = csv.reader(f)
    for url in reader:
        print(url[0])

To read up more on lists, you can refer here. You can add more columns to the CSV file and experiment to see how the results would appear. If you would like to refer to the column name while reading each row, you can refer here.

Raymond C.
  • 572
  • 4
  • 24