1

I have 2 .txt files, and I converted them into .csv files using https://convertio.co/csv-xlsx/. Now, I would like to import these two .csv files into two databases using SQLite in Python (UI is Jupyter Notebook). These two .csv files are labeled person.csv and person_votes.csv. So, I did it by following the code given here (Importing a CSV file into a sqlite3 database table using Python):

import sqlite3, csv

con = sqlite3.connect(":memory:")
cur = con.cursor()
cur.execute("CREATE TABLE person (personid STR,age STR,sex STR,primary_voting_address_id STR,state_code STR,state_fips STR,county_name STR,county_fips STR,city STR,zipcode STR, zip4 STR,  PRIMARY KEY(personid))") 

with open('person.csv','r') as person_table: # `with` statement available in 2.5+
    # csv.DictReader uses first line in file for column headings by default
    dr = csv.DictReader(person_table) # comma is default delimiter
#personid   age sex primary_voting_address_id   state_code  state_fips  county_name county_fips city    zipcode zip4
    to_db = [(i['personid'], i['age'], i['sex'], i['primary_voting_address_id'], i['state_code'], i['state_flips'], i['county_name'], i['county_fips'], i['city'], i['zipcode'], i['zip4']) for i in dr]

cur.executemany("INSERT INTO t (age, sex) VALUES (?, ?);", to_db)
con.commit()

I don't understand why when I tried executing the code above, I keep getting the error message: "KeyError: 'personid'". Could someone please help?

Also, if I create another database table named to_db2 for the file person_votes.csv in the same Python file, would the following query give me all the common elements between two tables:

select ID from to_db, to_db2 WHERE to_db.ID ==  to_db2

The link to the two .csv files above is here: https://drive.google.com/open?id=0B-cyvC6eCsyCQThUeEtGcWdBbXc.

user177196
  • 738
  • 1
  • 8
  • 16
  • https://csvkit.readthedocs.io/en/1.0.2/tutorial/3_power_tools.html#csvsql-and-sql2csv-ultimate-power – OneCricketeer Sep 03 '17 at 23:12
  • I tried their instructions, but failed right on their first line of code (`csvsql -i sqlite3 person.csv`) SyntaxError: invalid syntax at sqlite3 (what??) – user177196 Sep 03 '17 at 23:28
  • @cricket_007: could you please let me know why I got that error? I already import csvkit, as well as install csvkit into my environment in Anaconda;p – user177196 Sep 03 '17 at 23:45
  • Check that page again. Says `sqlite` – OneCricketeer Sep 04 '17 at 02:05
  • Thanks for your response. My Python's version is 3.0+, so it does not have sqlite. It only has sqlite3, but I believe they are not much different. – user177196 Sep 04 '17 at 02:10
  • The `-i` flag is the type of database, not the python module, afaik – OneCricketeer Sep 04 '17 at 02:42
  • I tried omitting them completely, and I still failed with `csvsql person.csv` (i.e, still get `invalid syntax` error). Could you try building one yourself if you have time and let me know if your code indeed works out with these two .csv files? – user177196 Sep 04 '17 at 03:48

2 Answers2

1

This works for me on Windows 10, but should work under Linux/Unix too. There are several problems:

  1. The last two rows of person.csv are not correct format, but this does not prevent the program from working. You can fix this with a text editor.
  2. person.csv uses tabs as the delimiter not commas.
  3. There is a typo (spelling) in the line that starts with "to_db ="
  4. There is a mismatch in the number of columns to import (2 instead of 11)
  5. Wrong table name on executemany.

In addition, I create the database in a file rather than in memory. It is small enough that performance should not be a problem and also any changes you make will be saved.

Here is my corrected file (you can do the other table yourself):

import sqlite3, csv

# con = sqlite3.connect(":memory:")
con = sqlite3.connect("person.db")
cur = con.cursor()
cur.execute("CREATE TABLE person (personid STR,age STR,sex STR,primary_voting_address_id STR,state_code STR,state_fips STR,county_name STR,county_fips STR,city STR,zipcode STR, zip4 STR,  PRIMARY KEY(personid))") 

with open('person.csv','r') as person_table:
    dr = csv.DictReader(person_table, delimiter='\t') # comma is default delimiter
    to_db = [(i['personid'], i['age'], i['sex'], i['primary_voting_address_id'], i['state_code'], i['state_fips'], i['county_name'], i['county_fips'], i['city'], i['zipcode'], i['zip4']) for i in dr]

cur.executemany("INSERT INTO person VALUES (?,?,?,?,?,?,?,?,?,?,?);", to_db)
con.commit()
Marichyasana
  • 2,966
  • 1
  • 19
  • 20
  • thank you so much for your help, Sir! I would try it and let you know right away in case I run into any problems in executing your code. So far, I have not seen the difference between yours and mine, except the line `dr = csv.DictReader(person_table, delimiter='\t')` – user177196 Sep 09 '17 at 17:02
  • Btw, do you know if the `DictReader()` function work with txt file? Apparently, when I tried using the website convertio.co to convert my `person.txt` file into a `.csv` file, it included only a portion of the data. – user177196 Sep 09 '17 at 17:44
  • @ghjk it depends on how the txt file is organized, do you have a link to it? – Marichyasana Sep 10 '17 at 05:27
  • I do!! Here is the link you need: https://drive.google.com/open?id=0B-cyvC6eCsyCTXBUa1lOSHRVOFU. Please let me know in which situations for a txt file, the above code does not work? And how to fix it if that's the case? – user177196 Sep 11 '17 at 05:55
  • Basically the txt file has to be organized in columns with a single tab between columns. If a value is missing you will then have two tabs. Both of your txt files work fine. If, for example, a value has an embedded tab then you have to edit it to use quotes. A very simple test is to open the txt file with Excel. – Marichyasana Sep 11 '17 at 11:02
  • Could you clarify on *edit it to use quotes*? I usually open the txt file by Sublime Text (I use Mac), and I just go directly into the data file to fix the places if I need. But it's *very* hard to figure out if a *large* datafile (defined as 100k+ rows) has missing data, and in which row/column the miss was in, so I don't really like this mechanism. You have any suggestion? Opening a txt file with Excel, then Save it back as a txt file, seems to change the UTF8-Code or sth, no?? – user177196 Sep 11 '17 at 18:16
  • If the delimiter character is part of one of your values you need to put quotes around the value so sqlite can parse it correctly. I use csvlint to validate the csv file. It will tell you line numbers. It is available at github `https://github.com/Clever/csvlint` You can download it here 'https://github.com/Clever/csvlint/releases` – Marichyasana Sep 11 '17 at 21:59
  • sorry, i was finally able to download the csvlint. Now, could I use the `import()` function in Python to call this package when writing code on Jupyter Notebook? Otherwise, based on my understanding from the command line window of csvlint, I would need to write `csvlint([file directory])` to execute this package? – user177196 Sep 13 '17 at 17:40
  • csvlint is a stand alone program used to validate a csv data file. I run it from the command line. – Marichyasana Sep 14 '17 at 02:14
  • what is the syntax you needed in order to validate a .csv file (e.g. `person.csv`)? – user177196 Sep 15 '17 at 03:27
  • csvlint -delimiter=\t person.csv – Marichyasana Sep 16 '17 at 00:44
0

Looks like you might be missing some column names in your INSERT INTO ... statement.

Probably not great practice leaving the Primary Key as NULL too.

Ash B
  • 121
  • 1
  • 6
  • Thank you for your comment. I did specify Primary Key = personID. Did you miss it? Other than that, the error I got came from this line: `to_db = [(i['personid'], i['age'], i['sex'], i['primary_voting_address_id'], i['state_code'], i['state_flips'], i['county_name'], i['county_fips'], i['city'], i['zipcode'], i['zip4']) for i in dr]`, so I am not sure if the line `INSERT INTO....` is the reason. I still get the same error message when specifying ALL the column names, to be honest. – user177196 Sep 03 '17 at 23:15
  • After I tried adding all the column names into `INSERT INTO...`, I got a new error message from the `cursor.execute...` statement: `OperationalError: table person already exists.` If I change the name of the table from `person` to `person1`, I ended up with the original error message at line 11: KeyError: 'personid'. Not sure why:( – user177196 Sep 03 '17 at 23:20
  • @Marichyasana: personID is unique, isn't it? Did you take a look at the .csv files?? – user177196 Sep 03 '17 at 23:27
  • @Marichyasana: could you try importing the two csv files by fixing my code above, and let me know what went wrong? I still could not see why I got that error:( – user177196 Sep 04 '17 at 00:50
  • @ghjk I downloaded the two files using wget and checked the first one with csvlint and a little bit using R. Give me a little time and I'll see. – Marichyasana Sep 05 '17 at 05:43
  • @Marichyasana: thank you so much for getting your hand dirty!! I am surprised you used R (not an expert on it, but I know a little bit). Could you please help show your code in either R or Python once you were able to import the two datasets into one database and populate them over two separate tables? – user177196 Sep 05 '17 at 06:11