2

I am having a problem with reading my csv file into the MySQL database. I have tried a number of solutions, but the errors just keep changing and the code isn't working. This same code had worked with another csv file, so I'm thinking I might be doing something wrong with this one?

Here is my code

from database_access import *
from builtins import bytes, int, str
import codecs
import csv
import requests

from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import re
import cgi
import MySQLdb
import chardet

# from database_access import *
import MySQLdb
import simplejson



if __name__ == '__main__':

    with open("SIMRA.csv",'r') as file:
        reader = csv.reader(file)
        #reader = csv.reader(text)
        next(reader, None)
        print ("project running")
        #print (row[7])
        #rowlist = []
        all_links = []
        all_project_ids = []


        for row in reader:
            if row[7] != "" and row[16] != "":
                    country = row[2]
                    city = row[8]
                    description = row[11] + '' + row[12]
                    title = row[7].replace("'", "''")
                    link = row[16]
                    #date_start = row[9]

                    #print a check here
                    print(title,description,country, city, link)

                    db = MySQLdb.connect(host, username, password, database, charset='utf8')
                    cursor = db.cursor()
                    new_project = True

                    proj_check = "SELECT * from Projects where ProjectName like '%" + title + "%'"
                    #proj_check = "SELECT * from Projects where ProjectName like %s",(title,)
                    #cur.execute("SELECT * FROM records WHERE email LIKE %s", (search,))
                    cursor.execute(proj_check)
                    num_rows = cursor.rowcount
                    if num_rows != 0:
                        new_project = False

                    url_compare = "SELECT * from Projects where ProjectWebpage like '" + link + "'"
                    #url_compare = "SELECT * from Projects where ProjectWebpage like %s",(link,)
                    cursor.execute(url_compare)
                    num_rows = cursor.rowcount
                    if num_rows != 0:
                        new_project = False

                    if new_project:
                        project_insert = "Insert into Projects (ProjectName,ProjectWebpage,FirstDataSource,DataSources_idDataSources) VALUES (%s,%s,%s,%s)"
                        cursor.execute(project_insert, (title, link,'SIMRA', 5))
                        projectid = cursor.lastrowid
                        print(projectid)



                        #ashoka_projectids.append(projectid)
                        db.commit()

                        ins_desc = "Insert into AdditionalProjectData (FieldName,Value,Projects_idProjects,DateObtained) VALUES (%s,%s,%s,NOW())"
                        cursor.executemany(ins_desc, ("Description", description, str(projectid)))
                        db.commit()

                        ins_location = "Insert into ProjectLocation (Type,Country,City,Projects_idProjects) VALUES (%s,%s,%s,%s)"
                        cursor.execute(ins_location, ("Main", country,city, str(projectid)))
                        db.commit()


                    else:

                        print('Project already exists!')
                        print(title)


                    all_links.append(link)

    #print out SIMRA's links to a file for crawling later
    with open('simra_links', 'w', newline='') as f:
        write = csv.writer(f)
        for row in all_links:
            columns = [c.strip() for c in row.strip(', ').split(',')]
            write.writerow(columns)

When I ran this, I got the following error:

File "/usr/lib/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 898: invalid start byte

I did some research and tried handling the encoding error by adding different forms of encoding, as seen here - UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte, and Python MySQLdb TypeError: not all arguments converted during string formatting. Added this in this in the csv open parameter -

with open("SIMRA.csv", 'r', encoding="cp437", errors='ignore') as file:

Running the code with these different encoding options came up with a different error:

MySQLdb._exceptions.ProgrammingError: not all arguments converted during bytes formatting

Further research suggested using tuples or lists in order to address this problem, so I added these in the 'select' function in the code, as suggested here - Python MySQLdb TypeError: not all arguments converted during string formatting and in the Python SQL documentation here - PythonMySqldb

So the select query became:

                   
                    proj_check = "SELECT * from Projects where ProjectName like %s",(title,)
                    cursor.execute(proj_check)
                    num_rows = cursor.rowcount
                    if num_rows != 0:
                        new_project = False

                    
                    url_compare = "SELECT * from Projects where ProjectWebpage like %s",(link,)
                    cursor.execute(url_compare)
                    num_rows = cursor.rowcount
                    if num_rows != 0:
                        new_project = False

When I ran the code, I came up with this Assertion Error and I have no idea what to do anymore.

File "/home/ros/.local/lib/python3.8/site-packages/MySQLdb/cursors.py", line 205, in execute assert isinstance(query, (bytes, bytearray)) AssertionError

I have run out of ideas. It might be that I'm missing something small, but I can't figure this out now as I've been battling with this for two days now.

Can anyone help point out what I'm missing? It will be greatly appreciated. This code ran perfectly with another csv file. I am running this with Python 3.8 btw.

Rose_Trojan
  • 95
  • 1
  • 8
  • What’s the encoding of the file? What OS are you on? – Bohemian Jan 18 '22 at 19:18
  • Upon which line is this failing (in the original code). – JNevill Jan 18 '22 at 19:20
  • Try loading 1 line at a time by programmatically putting each line in its own file to identify which lines(s) are causing the problem. Then closely examine each byte of the problem line(s) to discover if there is anything unusual. – Bohemian Jan 18 '22 at 19:28
  • @Bohemian I got two different encodings for the file. When I ran `with open("SIMRA.csv",'r') as file: print (file)` I obtained the encoding as `utf-8`, but using chardet `with open("SIMRA.csv", 'rb') as file: print(chardet.detect(file.read()))`, I obtained `Windows-1254`. I have tried adding the two encodings at different times, but this does not work still. My OS is Windows. – Rose_Trojan Jan 18 '22 at 21:09
  • @JNevill it occurs on the proj_check line – Rose_Trojan Jan 18 '22 at 21:13
  • 1
    @Bohemian so, ran this section by section and found that none of the encodings worked. I tried a different one based on this - [link](https://exerror.com/unicodedecodeerror-utf-8-codec-cant-decode-byte-0xff-in-position-0-invalid-start-byte/) and used the ISO-8859-1 encoding and it worked – Rose_Trojan Jan 19 '22 at 00:00

2 Answers2

1

Have solved this now. I had to use a different encoding with the original code and this solved the problem. So, I changed the csv open parameter to:

with open("SIMRA.csv",'r', encoding="ISO-8859-1") as file:
    reader = csv.reader(file)
Rose_Trojan
  • 95
  • 1
  • 8
0

Were you expecting £? You need to specify what the encoding of the file is. It may be "latin1". See the syntax of LOAD DATA for how to specify CHARACTER SET latin1.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • No, I wasn't expecting `£`. I've put a comment above on the encoding of the document. I got two different encodings, but Notepad++ showed it's encoding to be `utf-8`. I've solved it now by specifying the encoding as `ISO-8859-1` – Rose_Trojan Jan 19 '22 at 11:28