2

I'm currently learning how to code and I have run into this challenge that I have been trying to solve for the last couple of days.

I have over 2000 CSV files that I would like to import into a particular postgresql table at once instead using the import data function on pgadmin 4 which only allows one to import one CSV file at a time. How should I go about doing this? I'm using Windows OS.

2 Answers2

0

Simple way is use Cygwin or inner Ubuntu shell for use this script

all_files=("file_1.csv" "file_2.csv") # OR u can change to * in dir

dir_name=<path_to_files>

export PGUSER=<username_here>
export PGPASSWORD=<password_here>
export PGHOST=localhost
export PGPORT=5432
db_name=<dbname_here>

echo "write db"
for file in ${all_files[*]}; do
  psql -U$db_name -a -f $dir_name/"${file}"".sql" >/dev/null
done
RuS
  • 71
  • 7
  • I have installed Cygwin but under dir_name whose path is E:\eventssmap for me there's an error cropping up "command not found" – TryingtoCode Nov 13 '19 at 17:49
0

If you want to do this purely in Python, then I have given an approach below. It's possible that you wouldn't need to chunk the list (that you could hold all of the files in memory at once and not need to do in batches). It's also possible that all the files are radically different sizes and you'd need something more sophisticated than just batches to prevent you creating an in-memory file object that exceeds your RAM. Or, you might choose to do it in 2000 separate transactions but I suspect some kind of batching will be faster (untested).

import csv
import io
import os
import psycopg2

CSV_DIR = 'the_csv_folder/' # Relative path here, might need to be an absolute path

def chunks(l, n):
    """ 
    https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
    """
    n = max(1, n)
    return [l[i:i+n] for i in range(0, len(l), n)]


# Get a list of all the CSV files in the directory
all_files = os.listdir(CSV_DIR)

# Chunk the list of files. Let's go with 100 files per chunk, can be changed
chunked_file_list = chunks(all_files, 100)

# Iterate the chunks and aggregate the files in each chunk into a single
# in-memory file
for chunk in chunked_file_list:

    # This is the file to aggregate into
    string_buffer = io.StringIO()
    csv_writer = csv.writer(string_buffer)

    for file in chunk:
        with open(CSV_DIR + file) as infile:
            reader = csv.reader(infile)
            data = reader.readlines()

        # Transfer the read data to the aggregated file
        csv_writer.writerows(data)

    # Now we have aggregated the chunk, copy the file to Postgres
    with psycopg2.connect(dbname='the_database_name', 
                          user='the_user_name',
                          password='the_password', 
                          host='the_host') as conn:
        c = conn.cursor()

        # Headers need to the table field names, in the order they appear in
        # the csv
        headers = ['first_name', 'last_name', ...]

        # Now upload the data as though it was a file
        c.copy_from(string_buffer, 'the_table_name', sep=',', columns=headers)
        conn.commit()
roganjosh
  • 12,594
  • 4
  • 29
  • 46
  • Thanks @roganjosh, I have edited the code a bit and it now runs but when call the function "chunks (1,2000)", I'm getting this message "" and no data is being copied to postgresql. Any guidance from here. – TryingtoCode Nov 15 '19 at 02:17
  • @TryingtoCode you edited it to create a generator. I don't know why you've done that but I can't debug what I can't see and my answer doesn't cause this issue so I'm not sure what I'm supposed to answer – roganjosh Nov 16 '19 at 09:15