0

In a postgresql database:

class Persons(models.Model):
    person_name = models.CharField(max_length=10, unique=True)

The persons.csv file, contains 1 million names.

$cat persons.csv
Name-1
Name-2
...
Name-1000000

I want to:

  1. Create the names that do not already exist
  2. Query the database and fetch the id for each name contained in the csv file.

My approach:

  1. Use the COPY command or the django-postgres-copy application that implements it.

    Also take advantage of the new Postgresql-9.5+ upsert feature.

  2. Now, all the names in the csv file, are also in the database.

    I need to get their ids -from the database- either in memory or in another csv file with an efficient way:

    • Use Q objects

      list_of_million_q = <iterate csv and append Qs>
      million_names = Names.objects.filter(list_of_million_q)
      

      or

    • Use __in to filter based on a list of names:

      list_of_million_names = <iterate csv and append strings>
      million_names = Names.objects.filter(
          person_name__in=[list_of_million_names]
      )
      

      or

      ?

I do not feel that any of the above approaches for fetching the ids is efficient.

Update

There is a third option, along the lines of this post that should be a great solution which combines all the above.

raratiru
  • 8,748
  • 4
  • 73
  • 113

2 Answers2

1

Something like:

SELECT * FROM persons;

make a name: id dictionary out of the names recieved from the database:

db_dict = {'Harry': 1, 'Bob': 2, ...}

Query the dictionary:

ids = []
for name in list_of_million_names:
    if name in db_dict:
        ids.append(db_dict[name])

This way you're using the quick dictionary indexing as opposed to the slower if x in list approach.

But the only way to really know for sure is to benchmark these 3 approaches.

Joost
  • 3,609
  • 2
  • 12
  • 29
  • If the table `Persons` has millions of entries, `SELECT * FROM persons;` will bloat the memory, won't it? – raratiru Apr 15 '18 at 16:27
  • Even 10 million entries with names of 10 characters long will only take in about half a gig in memory, so that's not really a bottleneck. – Joost Apr 15 '18 at 16:36
  • I understand your point but I am running on a budget server with 1Gb memory for all operations! Probably something in the lines of [this answer](https://stackoverflow.com/a/46953777/2996101) could do the job. – raratiru Apr 15 '18 at 16:39
  • Is this something that you need to do often? Because if so, you should probably save the output somewhere. – Joost Apr 15 '18 at 16:43
  • Yes this will be done oftenly. It would be ideal to `insert` or `do nothing`and save the returning ids to a csv file. – raratiru Apr 15 '18 at 16:46
0

This post describes how to use RETURNING with ON CONFLICT so while inserting into the database the contents of the csv file, the ids will be saved in another table either when an insertion was successful, or when -due to unique constraints- the insertion was omitted.

I have tested it in sqlfiddle where I used a set up that resembles the one used for the COPY command which inserts to the database straight from a csv file, respecting the unique constraints.

The schema:

CREATE TABLE IF NOT EXISTS label (
  id serial PRIMARY KEY,
  label_name varchar(200) NOT NULL UNIQUE
  );
INSERT INTO label (label_name) VALUES
  ('Name-1'),
  ('Name-2');

CREATE TABLE IF NOT EXISTS ids (
  id serial PRIMARY KEY,
  label_ids varchar(12) NOT NULL
  );

The script:

CREATE TEMP TABLE tmp_table
(LIKE label INCLUDING DEFAULTS)
ON COMMIT DROP;

INSERT INTO tmp_table (label_name) VALUES
  ('Name-2'),
  ('Name-3');

WITH ins AS(
  INSERT INTO label 
  SELECT *
  FROM tmp_table
  ON CONFLICT (label_name) DO NOTHING
  RETURNING id
)
INSERT INTO ids (label_ids)
SELECT
  id FROM ins
UNION ALL
SELECT
  l.id FROM tmp_table
JOIN label l USING(label_name);

The output:

SELECT * FROM ids;
SELECT * FROM label;
raratiru
  • 8,748
  • 4
  • 73
  • 113