150

I have a CSV file and I want to bulk-import this file into my sqlite3 database using Python. the command is ".import .....". but it seems that it cannot work like this. Can anyone give me an example of how to do it in sqlite3? I am using windows just in case. Thanks

igauravsehrawat
  • 3,696
  • 3
  • 33
  • 46
Hossein
  • 40,161
  • 57
  • 141
  • 175
  • 4
    Please provide the **actual** command that didn't work and the **actual** error message. "import...." could be anything. "cannot work" is too vague for us to guess at. Without details, we can't help. – S.Lott May 22 '10 at 11:30
  • 3
    the actual command as I said is ".import" and it says syntax error new ".import" – Hossein May 22 '10 at 11:36
  • 14
    Please actually post the actual command in the question. Please actually post the actual error message in the question. Please do not add comments that simply repeat things. Please update the question with actual copy and paste of what you're actually doing. – S.Lott May 22 '10 at 11:46
  • FWIW `.import` is the command for importing files within the SQLite interactive shell. – snakecharmerb Jul 19 '22 at 06:41

19 Answers19

178
import csv, sqlite3

con = sqlite3.connect(":memory:") # change to 'sqlite:///your_filename.db'
cur = con.cursor()
cur.execute("CREATE TABLE t (col1, col2);") # use your column names here

with open('data.csv','r') as fin: # `with` statement available in 2.5+
    # csv.DictReader uses first line in file for column headings by default
    dr = csv.DictReader(fin) # comma is default delimiter
    to_db = [(i['col1'], i['col2']) for i in dr]

cur.executemany("INSERT INTO t (col1, col2) VALUES (?, ?);", to_db)
con.commit()
con.close()
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
  • I keep getting `not all arguments converted during string formatting` when I attempt this method. – Whitecat Sep 01 '16 at 23:27
  • I tried this method, but it doesn't work for me. Could you check out my datasets here (they are very normal, except some columns have empty values) and try importing them with your code? https://stackoverflow.com/questions/46042623/two-questions-on-sqlite3-and-python-with-input-as-csv-files?noredirect=1#comment79045812_46042623 – user177196 Sep 04 '17 at 20:41
  • having 20MB csv file, script breaks in halfway – holms Apr 30 '18 at 14:51
  • I'm using your method to read the csv file (in python 3.6) but I keep getting a `KeyError` for the first column. Any ideas? – Isaac Jan 06 '19 at 22:07
  • 1
    If you want to save the created db on disk, first create an empty file, then give its path in place of `:memory:`. – LoMaPh Feb 07 '19 at 22:59
  • 5
    This code is not optimized for very large csv files (order of GBs) – Nisba Jul 26 '19 at 11:00
  • `_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)`. csv on python 3 no support binary mode. – e-info128 Aug 29 '19 at 22:59
  • In python 3.6 and above, use `'r ` instead of `'rb'` to read the csv. – eddyizm Dec 19 '19 at 03:56
  • Isaac, if getting a `KeyError` then the column name (or delimeter) does not match. In a tab-separated file, may have to change this line: `dr = csv.DictReader(fin, delimiter='\t')` – rdtsc Jul 16 '22 at 16:04
139

Creating an sqlite connection to a file on disk is left as an exercise for the reader ... but there is now a two-liner made possible by the pandas library

df = pandas.read_csv(csvfile)
df.to_sql(table_name, conn, if_exists='append', index=False)
Tennessee Leeuwenburg
  • 1,693
  • 1
  • 12
  • 9
  • thank you. I got an issue with panda. my csv is delimited by ';' and have ',' in entries. panda gives error on read_csv. any setting to read entries with commas w/out temporarily replace? – Alex Martian Jun 19 '16 at 06:42
  • 3
    use sep=';'. The pandas documentation clearly outlines how to deal with this. – Tennessee Leeuwenburg Jun 20 '16 at 08:01
  • 6
    is there a way to use pandas but without using the RAM?, i have a huge .csv (7gb) i can't import as a dataframe and then appended to the DB. – Pablo Nov 18 '16 at 12:57
  • 2
    Yes, there's a method in pandas which will read in chunks rather than all at once. I'm afraid I can't recall exactly off the top of my head. I think you add chunksize=, and then you get back an iterator which you can then use to append to a database piecewise. Let me know if you have trouble finding it and I can dig out a recipe. – Tennessee Leeuwenburg Dec 04 '16 at 07:00
  • 2
    Very nice, @TennesseeLeeuwenburg. I didn't have a need for `df` so I shortened your example down to: `pandas.read_csv(csvfile).to_sql(table_name, conn, if_exists='append', index=False)` – keithpjolley May 23 '19 at 19:56
  • Just a quick comment, the table name should be in parentheses. Example: pandas.read_csv(csvfile.to_sql('table_1', conn, if_exists='append', index=False). Here is a reference to pandas documentation demonstrating this: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html – Erich Purpur Jul 01 '20 at 13:32
  • the problem with pandas is the data frames they store data in are very memory consuming especially for a bulk import – Eric Hedengren Nov 10 '20 at 00:15
  • @Pablo the chunksize argument may be what you're looking for (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html ) – KBurchfiel Sep 18 '21 at 19:50
37

You're right that .import is the way to go, but that's a command from the SQLite3 command line program. A lot of the top answers to this question involve native python loops, but if your files are large (mine are 10^6 to 10^7 records), you want to avoid reading everything into pandas or using a native python list comprehension/loop (though I did not time them for comparison).

For large files, I believe the best option is to use subprocess.run() to execute sqlite's import command. In the example below, I assume the table already exists, but the csv file has headers in the first row. See .import docs for more info.

subprocess.run()

from pathlib import Path
db_name = Path('my.db').resolve()
csv_file = Path('file.csv').resolve()
result = subprocess.run(['sqlite3',
                         str(db_name),
                         '-cmd',
                         '.mode csv',
                         '.import --skip 1 ' + str(csv_file).replace('\\','\\\\')
                                 +' <table_name>'],
                        capture_output=True)

edit note: sqlite3's .import command has improved so that it can treat the first row as header names or even skip the first x rows (requires version >=3.32, as noted in this answer. If you have an older version of sqlite3, you may need to first create the table, then strip off the first row of the csv before importing. The --skip 1 argument will give an error prior to 3.32

Explanation
From the command line, the command you're looking for is sqlite3 my.db -cmd ".mode csv" ".import file.csv table". subprocess.run() runs a command line process. The argument to subprocess.run() is a sequence of strings which are interpreted as a command followed by all of it's arguments.

  • sqlite3 my.db opens the database
  • -cmd flag after the database allows you to pass multiple follow on commands to the sqlite program. In the shell, each command has to be in quotes, but here, they just need to be their own element of the sequence
  • '.mode csv' does what you'd expect
  • '.import --skip 1'+str(csv_file).replace('\\','\\\\')+' <table_name>' is the import command.
    Unfortunately, since subprocess passes all follow-ons to -cmd as quoted strings, you need to double up your backslashes if you have a windows directory path.

Stripping Headers

Not really the main point of the question, but here's what I used. Again, I didn't want to read the whole files into memory at any point:

with open(csv, "r") as source:
    source.readline()
    with open(str(csv)+"_nohead", "w") as target:
        shutil.copyfileobj(source, target)

Jake Stevens-Haas
  • 1,186
  • 2
  • 14
  • 26
  • Couldn't make `--skip 1` work with 3.32.3 and 3.36.0 – roman Jun 29 '21 at 08:31
  • @roman from the command line or `subprocess.run()` ? – Jake Stevens-Haas Jul 12 '21 at 17:49
  • 1
    Thanks for the solution! A remark... your solution requires to have already installed `sqlite` in the os. Do you know if it possible to make it independent from external installations? I mean, `sqlite` is part of the standard library so, I guess, it should be installed somewhere in the os, and being able to use that it could be great! – cards May 12 '22 at 19:00
  • 1
    @cards I don't think it is. The `sqlite` built-in library imports directly from `_sqlite`, which is written in C. In it, header files state: `#include "sqlite3.h"`. These are provided from having sqlite already installed on the system. I don't know much about `.configure` and `make`, but I didn't see anything that would build this header - it expects your OS and your compiler know where to find sqlite3.h. So I don't think any standard builds of CPython distribute a copy of sqlite3. – Jake Stevens-Haas May 28 '22 at 17:18
14

My 2 cents (more generic):

import csv, sqlite3
import logging

def _get_col_datatypes(fin):
    dr = csv.DictReader(fin) # comma is default delimiter
    fieldTypes = {}
    for entry in dr:
        feildslLeft = [f for f in dr.fieldnames if f not in fieldTypes.keys()]
        if not feildslLeft: break # We're done
        for field in feildslLeft:
            data = entry[field]

            # Need data to decide
            if len(data) == 0:
                continue

            if data.isdigit():
                fieldTypes[field] = "INTEGER"
            else:
                fieldTypes[field] = "TEXT"
        # TODO: Currently there's no support for DATE in sqllite

    if len(feildslLeft) > 0:
        raise Exception("Failed to find all the columns data types - Maybe some are empty?")

    return fieldTypes


def escapingGenerator(f):
    for line in f:
        yield line.encode("ascii", "xmlcharrefreplace").decode("ascii")


def csvToDb(csvFile, outputToFile = False):
    # TODO: implement output to file

    with open(csvFile,mode='r', encoding="ISO-8859-1") as fin:
        dt = _get_col_datatypes(fin)

        fin.seek(0)

        reader = csv.DictReader(fin)

        # Keep the order of the columns name just as in the CSV
        fields = reader.fieldnames
        cols = []

        # Set field and type
        for f in fields:
            cols.append("%s %s" % (f, dt[f]))

        # Generate create table statement:
        stmt = "CREATE TABLE ads (%s)" % ",".join(cols)

        con = sqlite3.connect(":memory:")
        cur = con.cursor()
        cur.execute(stmt)

        fin.seek(0)


        reader = csv.reader(escapingGenerator(fin))

        # Generate insert statement:
        stmt = "INSERT INTO ads VALUES(%s);" % ','.join('?' * len(cols))

        cur.executemany(stmt, reader)
        con.commit()

    return con
Madhawa Priyashantha
  • 9,633
  • 7
  • 33
  • 60
Guy L
  • 2,824
  • 2
  • 27
  • 37
13

The .import command is a feature of the sqlite3 command-line tool. To do it in Python, you should simply load the data using whatever facilities Python has, such as the csv module, and inserting the data as per usual.

This way, you also have control over what types are inserted, rather than relying on sqlite3's seemingly undocumented behaviour.

Marcelo Cantos
  • 181,030
  • 38
  • 327
  • 365
  • 1
    There is no need to prepare the insert. The source of SQL statements and compiled results are kept in a cache. – John Machin Jun 17 '10 at 04:13
  • @John Machin: Is there a link to how SQLite does this? – Marcelo Cantos Jun 17 '10 at 07:34
  • @Marcelo: If you are interested in HOW it's done (why?), look in the sqlite source or ask on the sqlite mailing list. – John Machin Jun 17 '10 at 07:54
  • 1
    @John Machin: I'm interested because in all the SQLite documentation that I've come across, there is not a single word about automatic caching of unprepared statements. I don't think it is reasonable to have to read source code or probe mailing lists to discover something as basic as whether I should prepare my SQL statements or not. What is your source of information on this? – Marcelo Cantos Jun 17 '10 at 08:40
  • 4
    @Marcelo: Actually it's done in the Python sqlite3 wrapper module. http://docs.python.org/library/sqlite3.html#module-functions-and-constants says """The sqlite3 module internally uses a statement cache to avoid SQL parsing overhead. If you want to explicitly set the number of statements that are cached for the connection, you can set the cached_statements parameter. The currently implemented default is to cache 100 statements.""" – John Machin Jun 17 '10 at 10:25
  • @John Machin: Thank you; that's very useful to know. In fact, after a peruse through the module docs, it seems that there is no way to explicitly prepare a SQLite statement, so the cache is the only way statements can be prepared. – Marcelo Cantos Jun 17 '10 at 11:56
10

Many thanks for bernie's answer! Had to tweak it a bit - here's what worked for me:

import csv, sqlite3
conn = sqlite3.connect("pcfc.sl3")
curs = conn.cursor()
curs.execute("CREATE TABLE PCFC (id INTEGER PRIMARY KEY, type INTEGER, term TEXT, definition TEXT);")
reader = csv.reader(open('PC.txt', 'r'), delimiter='|')
for row in reader:
    to_db = [unicode(row[0], "utf8"), unicode(row[1], "utf8"), unicode(row[2], "utf8")]
    curs.execute("INSERT INTO PCFC (type, term, definition) VALUES (?, ?, ?);", to_db)
conn.commit()

My text file (PC.txt) looks like this:

1 | Term 1 | Definition 1
2 | Term 2 | Definition 2
3 | Term 3 | Definition 3
Community
  • 1
  • 1
jiy
  • 858
  • 2
  • 9
  • 18
9
#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys, csv, sqlite3

def main():
    con = sqlite3.connect(sys.argv[1]) # database file input
    cur = con.cursor()
    cur.executescript("""
        DROP TABLE IF EXISTS t;
        CREATE TABLE t (COL1 TEXT, COL2 TEXT);
        """) # checks to see if table exists and makes a fresh table.

    with open(sys.argv[2], "rb") as f: # CSV file input
        reader = csv.reader(f, delimiter=',') # no header information with delimiter
        for row in reader:
            to_db = [unicode(row[0], "utf8"), unicode(row[1], "utf8")] # Appends data from CSV file representing and handling of text
            cur.execute("INSERT INTO neto (COL1, COL2) VALUES(?, ?);", to_db)
            con.commit()
    con.close() # closes connection to database

if __name__=='__main__':
    main()
Christopher
  • 125
  • 1
  • 8
7

Here are solutions that'll work if your CSV file is really big. Use to_sql as suggested by another answer, but set chunksize so it doesn't try to process the whole file at once.

import sqlite3
import pandas as pd

conn = sqlite3.connect('my_data.db')
c = conn.cursor()
users = pd.read_csv('users.csv')
users.to_sql('users', conn, if_exists='append', index = False, chunksize = 10000)

You can also use Dask, as described here to write a lot of Pandas DataFrames in parallel:

dto_sql = dask.delayed(pd.DataFrame.to_sql)
out = [dto_sql(d, 'table_name', db_url, if_exists='append', index=True)
       for d in ddf.to_delayed()]
dask.compute(*out)

See here for more details.

Powers
  • 18,150
  • 10
  • 103
  • 108
7
"""
cd Final_Codes
python csv_to_db.py
CSV to SQL DB
"""

import csv
import sqlite3
import os
import fnmatch

UP_FOLDER = os.path.dirname(os.getcwd())
DATABASE_FOLDER = os.path.join(UP_FOLDER, "Databases")
DBNAME = "allCompanies_database.db"


def getBaseNameNoExt(givenPath):
    """Returns the basename of the file without the extension"""
    filename = os.path.splitext(os.path.basename(givenPath))[0]
    return filename


def find(pattern, path):
    """Utility to find files wrt a regex search"""
    result = []
    for root, dirs, files in os.walk(path):
        for name in files:
            if fnmatch.fnmatch(name, pattern):
                result.append(os.path.join(root, name))
    return result


if __name__ == "__main__":
    Database_Path = os.path.join(DATABASE_FOLDER, DBNAME)
    # change to 'sqlite:///your_filename.db'
    csv_files = find('*.csv', DATABASE_FOLDER)

    con = sqlite3.connect(Database_Path)
    cur = con.cursor()
    for each in csv_files:
        with open(each, 'r') as fin:  # `with` statement available in 2.5+
            # csv.DictReader uses first line in file for column headings by default
            dr = csv.DictReader(fin)  # comma is default delimiter
            TABLE_NAME = getBaseNameNoExt(each)
            Cols = dr.fieldnames
            numCols = len(Cols)
            """
            for i in dr:
                print(i.values())
            """
            to_db = [tuple(i.values()) for i in dr]
            print(TABLE_NAME)
            # use your column names here
            ColString = ','.join(Cols)
            QuestionMarks = ["?"] * numCols
            ToAdd = ','.join(QuestionMarks)
            cur.execute(f"CREATE TABLE {TABLE_NAME} ({ColString});")
            cur.executemany(
                f"INSERT INTO {TABLE_NAME} ({ColString}) VALUES ({ToAdd});", to_db)
            con.commit()
    con.close()
    print("Execution Complete!")

This should come in handy when you have a lot of csv files in a folder which you wish to convert to a single .db file in a go!

Notice that you dont have to know the filenames, tablenames or fieldnames (column names) beforehand!

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
Farhan Hai Khan
  • 583
  • 7
  • 10
5

If the CSV file must be imported as part of a python program, then for simplicity and efficiency, you could use os.system along the lines suggested by the following:

import os

cmd = """sqlite3 database.db <<< ".import input.csv mytable" """

rc = os.system(cmd)

print(rc)

The point is that by specifying the filename of the database, the data will automatically be saved, assuming there are no errors reading it.

peak
  • 105,803
  • 17
  • 152
  • 177
4

You can do this using blaze & odo efficiently

import blaze as bz
csv_path = 'data.csv'
bz.odo(csv_path, 'sqlite:///data.db::data')

Odo will store the csv file to data.db (sqlite database) under the schema data

Or you use odo directly, without blaze. Either ways is fine. Read this documentation

Jacob
  • 369
  • 3
  • 8
Kathirmani Sukumar
  • 10,445
  • 5
  • 33
  • 34
4

Based on Guy L solution (Love it) but can handle escaped fields.

import csv, sqlite3

def _get_col_datatypes(fin):
    dr = csv.DictReader(fin) # comma is default delimiter
    fieldTypes = {}
    for entry in dr:
        feildslLeft = [f for f in dr.fieldnames if f not in fieldTypes.keys()]        
        if not feildslLeft: break # We're done
        for field in feildslLeft:
            data = entry[field]

            # Need data to decide
            if len(data) == 0:
                continue

            if data.isdigit():
                fieldTypes[field] = "INTEGER"
            else:
                fieldTypes[field] = "TEXT"
        # TODO: Currently there's no support for DATE in sqllite

    if len(feildslLeft) > 0:
        raise Exception("Failed to find all the columns data types - Maybe some are empty?")

    return fieldTypes


def escapingGenerator(f):
    for line in f:
        yield line.encode("ascii", "xmlcharrefreplace").decode("ascii")


def csvToDb(csvFile,dbFile,tablename, outputToFile = False):

    # TODO: implement output to file

    with open(csvFile,mode='r', encoding="ISO-8859-1") as fin:
        dt = _get_col_datatypes(fin)

        fin.seek(0)

        reader = csv.DictReader(fin)

        # Keep the order of the columns name just as in the CSV
        fields = reader.fieldnames
        cols = []

        # Set field and type
        for f in fields:
            cols.append("\"%s\" %s" % (f, dt[f]))

        # Generate create table statement:
        stmt = "create table if not exists \"" + tablename + "\" (%s)" % ",".join(cols)
        print(stmt)
        con = sqlite3.connect(dbFile)
        cur = con.cursor()
        cur.execute(stmt)

        fin.seek(0)


        reader = csv.reader(escapingGenerator(fin))

        # Generate insert statement:
        stmt = "INSERT INTO \"" + tablename + "\" VALUES(%s);" % ','.join('?' * len(cols))

        cur.executemany(stmt, reader)
        con.commit()
        con.close()
Jace
  • 61
  • 3
3

The following can also add fields' name based on the CSV header:

import sqlite3

def csv_sql(file_dir,table_name,database_name):
    con = sqlite3.connect(database_name)
    cur = con.cursor()
    # Drop the current table by: 
    # cur.execute("DROP TABLE IF EXISTS %s;" % table_name)

    with open(file_dir, 'r') as fl:
        hd = fl.readline()[:-1].split(',')
        ro = fl.readlines()
        db = [tuple(ro[i][:-1].split(',')) for i in range(len(ro))]

    header = ','.join(hd)
    cur.execute("CREATE TABLE IF NOT EXISTS %s (%s);" % (table_name,header))
    cur.executemany("INSERT INTO %s (%s) VALUES (%s);" % (table_name,header,('?,'*len(hd))[:-1]), db)
    con.commit()
    con.close()

# Example:
csv_sql('./surveys.csv','survey','eco.db')
Ashkan Mirzaee
  • 300
  • 4
  • 14
2

in the interest of simplicity, you could use the sqlite3 command line tool from the Makefile of your project.

%.sql3: %.csv
    rm -f $@
    sqlite3 $@ -echo -cmd ".mode csv" ".import $< $*"
%.dump: %.sql3
    sqlite3 $< "select * from $*"

make test.sql3 then creates the sqlite database from an existing test.csv file, with a single table "test". you can then make test.dump to verify the contents.

jcomeau_ictx
  • 37,688
  • 6
  • 92
  • 107
2

With this you can do joins on CSVs as well:

import sqlite3
import os
import pandas as pd
from typing import List

class CSVDriver:
    def __init__(self, table_dir_path: str):
        self.table_dir_path = table_dir_path  # where tables (ie. csv files) are located
        self._con = None

    @property
    def con(self) -> sqlite3.Connection:
        """Make a singleton connection to an in-memory SQLite database"""
        if not self._con:
            self._con = sqlite3.connect(":memory:")
        return self._con
    
    def _exists(self, table: str) -> bool:
        query = """
        SELECT name
        FROM sqlite_master 
        WHERE type ='table'
        AND name NOT LIKE 'sqlite_%';
        """
        tables = self.con.execute(query).fetchall()
        return table in tables

    def _load_table_to_mem(self, table: str, sep: str = None) -> None:
        """
        Load a CSV into an in-memory SQLite database
        sep is set to None in order to force pandas to auto-detect the delimiter
        """
        if self._exists(table):
            return
        file_name = table + ".csv"
        path = os.path.join(self.table_dir_path, file_name)
        if not os.path.exists(path):
            raise ValueError(f"CSV table {table} does not exist in {self.table_dir_path}")
        df = pd.read_csv(path, sep=sep, engine="python")  # set engine to python to skip pandas' warning
        df.to_sql(table, self.con, if_exists='replace', index=False, chunksize=10000)

    def query(self, query: str) -> List[tuple]:
        """
        Run an SQL query on CSV file(s). 
        Tables are loaded from table_dir_path
        """
        tables = extract_tables(query)
        for table in tables:
            self._load_table_to_mem(table)
        cursor = self.con.cursor()
        cursor.execute(query)
        records = cursor.fetchall()
        return records

extract_tables():

import sqlparse
from sqlparse.sql import IdentifierList, Identifier,  Function
from sqlparse.tokens import Keyword, DML
from collections import namedtuple
import itertools

class Reference(namedtuple('Reference', ['schema', 'name', 'alias', 'is_function'])):
    __slots__ = ()

    def has_alias(self):
        return self.alias is not None

    @property
    def is_query_alias(self):
        return self.name is None and self.alias is not None

    @property
    def is_table_alias(self):
        return self.name is not None and self.alias is not None and not self.is_function

    @property
    def full_name(self):
        if self.schema is None:
            return self.name
        else:
            return self.schema + '.' + self.name

def _is_subselect(parsed):
    if not parsed.is_group:
        return False
    for item in parsed.tokens:
        if item.ttype is DML and item.value.upper() in ('SELECT', 'INSERT',
                                                        'UPDATE', 'CREATE', 'DELETE'):
            return True
    return False


def _identifier_is_function(identifier):
    return any(isinstance(t, Function) for t in identifier.tokens)


def _extract_from_part(parsed):
    tbl_prefix_seen = False
    for item in parsed.tokens:
        if item.is_group:
            for x in _extract_from_part(item):
                yield x
        if tbl_prefix_seen:
            if _is_subselect(item):
                for x in _extract_from_part(item):
                    yield x
            # An incomplete nested select won't be recognized correctly as a
            # sub-select. eg: 'SELECT * FROM (SELECT id FROM user'. This causes
            # the second FROM to trigger this elif condition resulting in a
            # StopIteration. So we need to ignore the keyword if the keyword
            # FROM.
            # Also 'SELECT * FROM abc JOIN def' will trigger this elif
            # condition. So we need to ignore the keyword JOIN and its variants
            # INNER JOIN, FULL OUTER JOIN, etc.
            elif item.ttype is Keyword and (
                    not item.value.upper() == 'FROM') and (
                    not item.value.upper().endswith('JOIN')):
                tbl_prefix_seen = False
            else:
                yield item
        elif item.ttype is Keyword or item.ttype is Keyword.DML:
            item_val = item.value.upper()
            if (item_val in ('COPY', 'FROM', 'INTO', 'UPDATE', 'TABLE') or
                    item_val.endswith('JOIN')):
                tbl_prefix_seen = True
        # 'SELECT a, FROM abc' will detect FROM as part of the column list.
        # So this check here is necessary.
        elif isinstance(item, IdentifierList):
            for identifier in item.get_identifiers():
                if (identifier.ttype is Keyword and
                        identifier.value.upper() == 'FROM'):
                    tbl_prefix_seen = True
                    break


def _extract_table_identifiers(token_stream):
    for item in token_stream:
        if isinstance(item, IdentifierList):
            for ident in item.get_identifiers():
                try:
                    alias = ident.get_alias()
                    schema_name = ident.get_parent_name()
                    real_name = ident.get_real_name()
                except AttributeError:
                    continue
                if real_name:
                    yield Reference(schema_name, real_name,
                                    alias, _identifier_is_function(ident))
        elif isinstance(item, Identifier):
            yield Reference(item.get_parent_name(), item.get_real_name(),
                            item.get_alias(), _identifier_is_function(item))
        elif isinstance(item, Function):
            yield Reference(item.get_parent_name(), item.get_real_name(),
                            item.get_alias(), _identifier_is_function(item))


def extract_tables(sql):
    # let's handle multiple statements in one sql string
    extracted_tables = []
    statements = list(sqlparse.parse(sql))
    for statement in statements:
        stream = _extract_from_part(statement)
        extracted_tables.append([ref.name for ref in _extract_table_identifiers(stream)])
    return list(itertools.chain(*extracted_tables))

Example (assuming account.csv and tojoin.csv exist in /path/to/files):

db_path = r"/path/to/files"
driver = CSVDriver(db_path)
query = """
SELECT tojoin.col_to_join 
FROM account
LEFT JOIN tojoin
ON account.a = tojoin.a
"""
driver.query(query)
Michał Zawadzki
  • 695
  • 6
  • 14
1
import csv, sqlite3

def _get_col_datatypes(fin):
    dr = csv.DictReader(fin) # comma is default delimiter
    fieldTypes = {}
    for entry in dr:
        feildslLeft = [f for f in dr.fieldnames if f not in fieldTypes.keys()]        
        if not feildslLeft: break # We're done
        for field in feildslLeft:
            data = entry[field]

        # Need data to decide
        if len(data) == 0:
            continue

        if data.isdigit():
            fieldTypes[field] = "INTEGER"
        else:
            fieldTypes[field] = "TEXT"
    # TODO: Currently there's no support for DATE in sqllite

if len(feildslLeft) > 0:
    raise Exception("Failed to find all the columns data types - Maybe some are empty?")

return fieldTypes


def escapingGenerator(f):
    for line in f:
        yield line.encode("ascii", "xmlcharrefreplace").decode("ascii")


def csvToDb(csvFile,dbFile,tablename, outputToFile = False):

    # TODO: implement output to file

    with open(csvFile,mode='r', encoding="ISO-8859-1") as fin:
        dt = _get_col_datatypes(fin)

        fin.seek(0)

        reader = csv.DictReader(fin)

        # Keep the order of the columns name just as in the CSV
        fields = reader.fieldnames
        cols = []

        # Set field and type
        for f in fields:
            cols.append("\"%s\" %s" % (f, dt[f]))

        # Generate create table statement:
        stmt = "create table if not exists \"" + tablename + "\" (%s)" % ",".join(cols)
        print(stmt)
        con = sqlite3.connect(dbFile)
        cur = con.cursor()
        cur.execute(stmt)

        fin.seek(0)


        reader = csv.reader(escapingGenerator(fin))

        # Generate insert statement:
        stmt = "INSERT INTO \"" + tablename + "\" VALUES(%s);" % ','.join('?' * len(cols))

        cur.executemany(stmt, reader)
        con.commit()
        con.close()
Bussller
  • 1,961
  • 6
  • 36
  • 50
Ramy Awad
  • 11
  • 1
1

I've found that it can be necessary to break up the transfer of data from the csv to the database in chunks as to not run out of memory. This can be done like this:

import csv
import sqlite3
from operator import itemgetter

# Establish connection
conn = sqlite3.connect("mydb.db")

# Create the table 
conn.execute(
    """
    CREATE TABLE persons(
        person_id INTEGER,
        last_name TEXT, 
        first_name TEXT, 
        address TEXT
    )
    """
)

# These are the columns from the csv that we want
cols = ["person_id", "last_name", "first_name", "address"]

# If the csv file is huge, we instead add the data in chunks
chunksize = 10000

# Parse csv file and populate db in chunks
with conn, open("persons.csv") as f:
    reader = csv.DictReader(f)

    chunk = []
    for i, row in reader: 

        if i % chunksize == 0 and i > 0:
            conn.executemany(
                """
                INSERT INTO persons
                    VALUES(?, ?, ?, ?)
                """, chunk
            )
            chunk = []

        items = itemgetter(*cols)(row)
        chunk.append(items)

Peter H.
  • 1,995
  • 8
  • 26
1

Here is my version, works already by asking you to select the '.csv' file you want to convert

from multiprocessing import current_process
import pandas as pd
import sqlite3 
import os
from tkinter import Tk
from tkinter.filedialog import askopenfilename
from pathlib import Path

def csv_to_db(csv_filedir):

    if not Path(csv_filedir).is_file():                         # if needed ask for user input of CVS file
        current_path = os.getcwd()
        Tk().withdraw()                                     
        csv_filedir = askopenfilename(initialdir=current_path) 

    try:
        data = pd.read_csv(csv_filedir)                             # load CSV file
    except:
        print("Something went wrong when opening to the file")
        print(csv_filedir)

    csv_df = pd.DataFrame(data)
    csv_df = csv_df.fillna('NULL')                              # make NaN = to 'NULL' for SQL format

    [path,filename] = os.path.split(csv_filedir)                # define path and filename 
    [filename,_] = os.path.splitext(filename)
    database_filedir = os.path.join(path, filename + '.db')

    conn = sqlite3.connect(database_filedir)                    # connect to SQL server

    [fields_sql, header_sql_string] = create_sql_fields(csv_df)

    # CREATE EMPTY DATABASE
    create_sql = ''.join(['CREATE TABLE IF NOT EXISTS ' + filename + ' (' + fields_sql + ')'])
    cursor = conn.cursor()
    cursor.execute(create_sql)
    
    # INSERT EACH ROW IN THE SQL DATABASE
    for irow in csv_df.itertuples():
        insert_values_string = ''.join(['INSERT INTO ', filename, header_sql_string, ' VALUES ('])
        insert_sql = f"{insert_values_string} {irow[1]}, '{irow[2]}','{irow[3]}', {irow[4]}, '{irow[5]}' )"
        print(insert_sql)
        cursor.execute(insert_sql)

    # COMMIT CHANGES TO DATABASE AND CLOSE CONNECTION
    conn.commit()
    conn.close()

    print('\n' + csv_filedir + ' \n converted to \n' + database_filedir)

    return database_filedir


def create_sql_fields(df):                                          # gather the headers of the CSV and create two strings 
    fields_sql = []                                                 # str1 = var1 TYPE, va2, TYPE ...
    header_names = []                                               # str2 = var1, var2, var3, var4
    for col in range(0,len(df.columns)):
        fields_sql.append(df.columns[col])
        fields_sql.append(str(df.dtypes[col]))

        header_names.append(df.columns[col])
        if col != len(df.columns)-1:
            fields_sql.append(',')
            header_names.append(',')

    fields_sql = ' '.join(fields_sql)
    fields_sql = fields_sql.replace('int64','integer')
    fields_sql = fields_sql.replace('float64','integer')
    fields_sql = fields_sql.replace('object','text')

    header_sql_string = '(' + ''.join(header_names) + ')'
    
    return fields_sql, header_sql_string


csv_to_db('')
Bas
  • 358
  • 3
  • 7
-1

To send csv data to sqlite3 database you can use the following method-

import csv
import sqlite3
import re

def csv_to_db(csv_filepath,db_filepath):
    data=[]
    with open(csv_filepath) as file:
        reader=csv.reader(file)
        data.append(tuple(reader)) 
        # insert operation in sqlite3 takes tuples as input


    connection = sqlite3.connect(db_filepath)
    cursor = connection.cursor()


    # Insert new rows
    rows=data
    
    for row in rows:
        cursor.execute(f"INSERT INTO {table} VALUES(?,?,?,?,?,?)", row)
    

    connection.commit()
    connection.close()
  • There are already a lot of answers here. Please explain your code. – ChrisGPT was on strike Aug 25 '23 at 23:40
  • This method takes your csv filepath and database (SQLite) filepath as input and copies your csv data to your sqlite database table. ( Note: the table name should be specified in the cursor.execute line and the amount of VALUES (?) to be inserted should be adjusted according to the number of fields in your table) – SHUBHAM MAHAJAN Aug 27 '23 at 10:47