1

I have a 10GB csv file (34 million rows) with data(without column description/header) that needs to be populated in a Postgres database. The row data has columns that need to go in different Models.

I have the following DB schema:

enter image description here

Currently what I do is:

  1. Loop through rows:
    1. Create instance B with specific columns from row and append to an array_b
    2. Create instance C with specific columns from row and append to an array_c
    3. Create instance A with specific columns from row and relation to B and C, and append to an array_a
  2. Bulk create in order: B, C and A

This works perfectly fine, however, it takes 4 hours to populate the DB. I wanted to optimize the population and came across the psql COPY FROM command. So I thought I would be able to do something like:

  1. Create instance A with specific columns from the file
    1. for foreign key to C
      • create instance C with specific columns from the row
    2. for foreign key to B
      • create instance B with specific columns from the row
  2. Go to 1.

After a short research on how to do that, I found out that it does not allow table manipulations while copying data (such as looking up to another table for fetching proper foreign keys to insert)

Anyone can guide me in what to look for, any other method or 'hack' on how to optimize the data population?

Thanks in advance.

Management command:

with open(settings.DATA_PATH, 'r') as file:
    csvreader = csv.reader(file)
    next(csvreader)

    b_array = []
    c_array = []
    a_array = []

    for row in csvreader:
        some_data = row[0]
        some_data = row[1]
        ....

        b = B(
          some_data=some_data            
        )
        b_array.append(b)

        c = C(
          some_data=some_data
        )
        c_array.append(c)


        a = A(
          some_data=_some_data,
          b=b,
          c=c          
        )
        a_array.append(property)

    B.objects.bulk_create(b_array)
    C.objects.bulk_create(c_array)
    A.objects.bulk_create(a_array)

References:

antpngl92
  • 494
  • 4
  • 12

0 Answers0