Import .csv files into SQL database using SQLite in Python

Question

I have 2 .txt files, and I converted them into .csv files using https://convertio.co/csv-xlsx/. Now, I would like to import these two .csv files into two databases using SQLite in Python (UI is Jupyter Notebook). These two .csv files are labeled person.csv and person_votes.csv. So, I did it by following the code given here (Importing a CSV file into a sqlite3 database table using Python):

import sqlite3, csv

con = sqlite3.connect(":memory:")
cur = con.cursor()
cur.execute("CREATE TABLE person (personid STR,age STR,sex STR,primary_voting_address_id STR,state_code STR,state_fips STR,county_name STR,county_fips STR,city STR,zipcode STR, zip4 STR,  PRIMARY KEY(personid))") 

with open('person.csv','r') as person_table: # `with` statement available in 2.5+
    # csv.DictReader uses first line in file for column headings by default
    dr = csv.DictReader(person_table) # comma is default delimiter
#personid   age sex primary_voting_address_id   state_code  state_fips  county_name county_fips city    zipcode zip4
    to_db = [(i['personid'], i['age'], i['sex'], i['primary_voting_address_id'], i['state_code'], i['state_flips'], i['county_name'], i['county_fips'], i['city'], i['zipcode'], i['zip4']) for i in dr]

cur.executemany("INSERT INTO t (age, sex) VALUES (?, ?);", to_db)
con.commit()

I don't understand why when I tried executing the code above, I keep getting the error message: "KeyError: 'personid'". Could someone please help?

Also, if I create another database table named to_db2 for the file person_votes.csv in the same Python file, would the following query give me all the common elements between two tables:

select ID from to_db, to_db2 WHERE to_db.ID ==  to_db2

The link to the two .csv files above is here: https://drive.google.com/open?id=0B-cyvC6eCsyCQThUeEtGcWdBbXc.

https://csvkit.readthedocs.io/en/1.0.2/tutorial/3_power_tools.html#csvsql-and-sql2csv-ultimate-power — OneCricketeer, Sep 03 '17 at 23:12
I tried their instructions, but failed right on their first line of code (`csvsql -i sqlite3 person.csv`) SyntaxError: invalid syntax at sqlite3 (what??) — user177196, Sep 03 '17 at 23:28
@cricket_007: could you please let me know why I got that error? I already import csvkit, as well as install csvkit into my environment in Anaconda;p — user177196, Sep 03 '17 at 23:45
Thanks for your response. My Python's version is 3.0+, so it does not have sqlite. It only has sqlite3, but I believe they are not much different. — user177196, Sep 04 '17 at 02:10
The `-i` flag is the type of database, not the python module, afaik — OneCricketeer, Sep 04 '17 at 02:42
I tried omitting them completely, and I still failed with `csvsql person.csv` (i.e, still get `invalid syntax` error). Could you try building one yourself if you have time and let me know if your code indeed works out with these two .csv files? — user177196, Sep 04 '17 at 03:48

Marichyasana · Accepted Answer · 2017-09-05T21:45:46.880

1

This works for me on Windows 10, but should work under Linux/Unix too. There are several problems:

The last two rows of person.csv are not correct format, but this does not prevent the program from working. You can fix this with a text editor.
person.csv uses tabs as the delimiter not commas.
There is a typo (spelling) in the line that starts with "to_db ="
There is a mismatch in the number of columns to import (2 instead of 11)
Wrong table name on executemany.

In addition, I create the database in a file rather than in memory. It is small enough that performance should not be a problem and also any changes you make will be saved.

Here is my corrected file (you can do the other table yourself):

import sqlite3, csv

# con = sqlite3.connect(":memory:")
con = sqlite3.connect("person.db")
cur = con.cursor()
cur.execute("CREATE TABLE person (personid STR,age STR,sex STR,primary_voting_address_id STR,state_code STR,state_fips STR,county_name STR,county_fips STR,city STR,zipcode STR, zip4 STR,  PRIMARY KEY(personid))") 

with open('person.csv','r') as person_table:
    dr = csv.DictReader(person_table, delimiter='\t') # comma is default delimiter
    to_db = [(i['personid'], i['age'], i['sex'], i['primary_voting_address_id'], i['state_code'], i['state_fips'], i['county_name'], i['county_fips'], i['city'], i['zipcode'], i['zip4']) for i in dr]

cur.executemany("INSERT INTO person VALUES (?,?,?,?,?,?,?,?,?,?,?);", to_db)
con.commit()

edited Sep 05 '17 at 21:45

answered Sep 05 '17 at 20:03

Marichyasana

2,966
1
19
20

thank you so much for your help, Sir! I would try it and let you know right away in case I run into any problems in executing your code. So far, I have not seen the difference between yours and mine, except the line `dr = csv.DictReader(person_table, delimiter='\t')` – user177196 Sep 09 '17 at 17:02
Btw, do you know if the `DictReader()` function work with txt file? Apparently, when I tried using the website convertio.co to convert my `person.txt` file into a `.csv` file, it included only a portion of the data. – user177196 Sep 09 '17 at 17:44
@ghjk it depends on how the txt file is organized, do you have a link to it? – Marichyasana Sep 10 '17 at 05:27
I do!! Here is the link you need: https://drive.google.com/open?id=0B-cyvC6eCsyCTXBUa1lOSHRVOFU. Please let me know in which situations for a txt file, the above code does not work? And how to fix it if that's the case? – user177196 Sep 11 '17 at 05:55
Basically the txt file has to be organized in columns with a single tab between columns. If a value is missing you will then have two tabs. Both of your txt files work fine. If, for example, a value has an embedded tab then you have to edit it to use quotes. A very simple test is to open the txt file with Excel. – Marichyasana Sep 11 '17 at 11:02
Could you clarify on *edit it to use quotes*? I usually open the txt file by Sublime Text (I use Mac), and I just go directly into the data file to fix the places if I need. But it's *very* hard to figure out if a *large* datafile (defined as 100k+ rows) has missing data, and in which row/column the miss was in, so I don't really like this mechanism. You have any suggestion? Opening a txt file with Excel, then Save it back as a txt file, seems to change the UTF8-Code or sth, no?? – user177196 Sep 11 '17 at 18:16
If the delimiter character is part of one of your values you need to put quotes around the value so sqlite can parse it correctly. I use csvlint to validate the csv file. It will tell you line numbers. It is available at github `https://github.com/Clever/csvlint` You can download it here 'https://github.com/Clever/csvlint/releases` – Marichyasana Sep 11 '17 at 21:59
sorry, i was finally able to download the csvlint. Now, could I use the `import()` function in Python to call this package when writing code on Jupyter Notebook? Otherwise, based on my understanding from the command line window of csvlint, I would need to write `csvlint([file directory])` to execute this package? – user177196 Sep 13 '17 at 17:40
csvlint is a stand alone program used to validate a csv data file. I run it from the command line. – Marichyasana Sep 14 '17 at 02:14
what is the syntax you needed in order to validate a .csv file (e.g. `person.csv`)? – user177196 Sep 15 '17 at 03:27
csvlint -delimiter=\t person.csv – Marichyasana Sep 16 '17 at 00:44

score 0 · Answer 2 · answered Sep 03 '17 at 22:59

0

Looks like you might be missing some column names in your INSERT INTO ... statement.

Probably not great practice leaving the Primary Key as NULL too.

answered Sep 03 '17 at 22:59

Ash B

121
1
6

Thank you for your comment. I did specify Primary Key = personID. Did you miss it? Other than that, the error I got came from this line: `to_db = [(i['personid'], i['age'], i['sex'], i['primary_voting_address_id'], i['state_code'], i['state_flips'], i['county_name'], i['county_fips'], i['city'], i['zipcode'], i['zip4']) for i in dr]`, so I am not sure if the line `INSERT INTO....` is the reason. I still get the same error message when specifying ALL the column names, to be honest. – user177196 Sep 03 '17 at 23:15
After I tried adding all the column names into `INSERT INTO...`, I got a new error message from the `cursor.execute...` statement: `OperationalError: table person already exists.` If I change the name of the table from `person` to `person1`, I ended up with the original error message at line 11: KeyError: 'personid'. Not sure why:( – user177196 Sep 03 '17 at 23:20
@Marichyasana: personID is unique, isn't it? Did you take a look at the .csv files?? – user177196 Sep 03 '17 at 23:27
@Marichyasana: could you try importing the two csv files by fixing my code above, and let me know what went wrong? I still could not see why I got that error:( – user177196 Sep 04 '17 at 00:50
@ghjk I downloaded the two files using wget and checked the first one with csvlint and a little bit using R. Give me a little time and I'll see. – Marichyasana Sep 05 '17 at 05:43
@Marichyasana: thank you so much for getting your hand dirty!! I am surprised you used R (not an expert on it, but I know a little bit). Could you please help show your code in either R or Python once you were able to import the two datasets into one database and populate them over two separate tables? – user177196 Sep 05 '17 at 06:11

Import .csv files into SQL database using SQLite in Python

2 Answers2

Linked