0

I have a problem in my scraping function.

In this project I have a sqlite3 database which contains links to musical albums' reviews. I create a scraper.py file which contains these 2 methods:

from bs4 import BeautifulSoup
import requests

def take_source(url):
    if 'http://' or 'https://' in url:
        source = requests.get(url).text
        return source
    else:
        print("Invalid URL")


def extract_corpus(source):
    soup = BeautifulSoup(source, "html.parser")
    soup.prettify().encode('cp1252', errors='ignore')
    corpus = []
    for e in soup.select("p"):
        corpus.append(e.text)

    return corpus

I call the extract_corpus method in a file called embedding.py, In this file I create a connection with the sqlite3 database and I put data in a Pandas Dataframe. I want to store the content of all the links in a csv file. My embedding.py file contains:

import sqlite3
import pandas as pd
import scraper
import csv

#create connection with sqlite db
con = sqlite3.connect("database.sqlite")

#creating a pandas data frame
query = pd.read_sql_query("SELECT url, artist, title FROM reviews;", con)


#populating data frame with urls
df = pd.DataFrame(query, columns=['url', 'artist', 'title'])

#preparing the .csv file for storing the reviews
with open('reviews.csv', 'w') as csvfile:
        fieldnames = ['title', 'artist', 'review']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

def append_csv(tit,art,rev):
    with open('reviews.csv','a') as csv_f:
        writer = csv.DictWriter(csv_f, fieldnames=fieldnames)
        writer.writerow({'title': tit, 'artist':art,'review':rev})

for i, row in df.iterrows():
    
    album = (str(row.__getitem__('title')))
    artist = (str(row.__getitem__('artist')))
    review = str(scraper.extract_corpus(scraper.take_source(str(row.__getitem__('url')))))
    append_csv(album,artist,review)
    

When I run this file, it works for an initial group of links, then it breaks returning the error in the title. This is the error:

Traceback (most recent call last): File "C:/Users/kikko/PycharmProjects/SongsBot/embedding.py", line 59, in append_csv(album,artist,review) File "C:/Users/kikko/PycharmProjects/SongsBot/embedding.py", line 52, in append_csv writer.writerow({'title': tit, 'artist':art,'review':rev}) File "C:\Users\kikko\AppData\Local\Programs\Python\Python37-32\lib\csv.py", line 155, in writerow return self.writer.writerow(self._dict_to_list(rowdict)) File "C:\Users\kikko\AppData\Local\Programs\Python\Python37-32\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u011f' in position 1087: character maps to

Unfortunately, I can't find the error.

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
  • In your own words, what do you expect `soup.prettify().encode('cp1252', errors='ignore')` to do? In particular, are you expecting the original `soup` to be modified? It does not: it creates instead a byte-encoding of the string, and then throws that away, unused. – Karl Knechtel Oct 14 '20 at 00:45

1 Answers1

0

It seems like you have multiple misunderstandings here.

soup.prettify().encode('cp1252', errors='ignore')

This does nothing useful: you create a string representing the HTML source (with .prettify), encode it as bytes (.encode), and then do nothing with the resulting object. The soup is unmodified.

Fortunately, you don't need or want to do anything about the encoding at this point in the process anyway. But it would be better to remove this line entirely, to avoid misleading yourself.

for e in soup.select("p"):
    corpus.append(e.text)

return corpus

You will produce and return a list of strings, which later you are trying to convert to string forcibly using str. The result will show the representation of the list: i.e., it will be enclosed in [] and have commas separating the items and quotes and escape sequences for each string. This is probably not what you wanted.

I assume you wanted to join the strings together, for example like '\n'.join(corpus). However, multiple-line data like this is not appropriate to store in a CSV. (An escaped list representation is also rather awkward to store in a CSV. You should probably think more about how you want to format the data.)

review = str(scraper.extract_corpus(scraper.take_source(str(row.__getitem__('url')))))

First off, you should not call double-underscore methods like __getitem__ directly. I know they are written that way in the documentation; that is just an artifact of how Python works in general. You are meant to use __getitem__ thus: row['url'].

You should expect the result to be a string already, so the inner str call is useless. Then you use take_source, which has this error:

if 'http://' or 'https://' in url:

This does not do what you want; the function will always think the URL is "valid".

Anyway, once you manage to extract_corpus and forcibly produce a string from it, actual problem you are asking about occurs:

with open('reviews.csv','a') as csv_f:

You cannot simply write any arbitrary string to a file in the cp1252 encoding (you know this is the one being used, because of the mention of cp1252.py in your stack trace; it is the default for your platform). This is the place where you are supposed to specify a file encoding. For example, you could specify that the file should be written using encoding='utf-8', which can handle any string. (You will also need to specify this explicitly when you open the file again for any other purpose.)

If you insist on doing the encoding manually, then you would need to .encode the thing you are .writeing to the file. However, because .encode produces the raw encoded bytes, you would then need to open the file in a binary mode (like 'ab'), and that would also mean you have to handle universal newline encoding yourself. It is not a pleasant task. Please just use the library according to how it was designed to be used.


When it comes to handling text encodings etc. properly, you cannot write correct code of decent quality simply by trying to fix each error as it comes up, doing a web search for each error or silencing a type error with a forced conversion. You must actually understand what is going on. I cannot stress this enough. Please start here, and then also read here. Read both top to bottom, aiming to understand what is being said rather than trying to solve any specific problem.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • Thank you so much for your detailed and consistent response. I have successfully solved the problem I had thanks to your clarifications. Unfortunately, I wasn't too familiar with these concepts and I wrote some code without paying attention to these issues. After carefully reading the articles you passed to me, I can say that my ideas are much clearer. Thanks again for the time you gave me. – Enrico Collu Oct 14 '20 at 11:21