0

I have MySQL database with table into which I insert from multiple files using LOAD DATA LOCAL INFILE ... statement. I have PRIMARY KEY ID set to auto_increment. The problem is, when I want to update only part of the table.

Say I've inserted file_1, file_2, file_3 in the past and now I want to update only file_2. I imagine the process in pseudo workflow

  1. delete old data related to file_2
  2. insert new data from file_2

However, it is hard to determine, which data are originally from file_2. In order to find out, I've come up with this idea:

When I insert the data, I will note the ids of the rows, which I've inserted, since I am using auto_increment I can note something like from_id, to_id for each of the file. Then, when I want to update only file_x I will delete only the data with from_id <= id <= to_id (where from_id, to_id relates to the file_x).

After little bit of searching, I've found out about @@identity and last_insert_id() (see), however, when I use select last_insert_id() after LOAD DATA LOCAL INFILE I get only one id, and not the maximal id corresponding to the data, but the last added (as it is defined). I am connecting to the database from Python using mysql.connnector using

cur.execute("select last_insert_id();")
print(cur.fetchall())
# gives
# [(<some_number>,)]

So, is there a way, how to retrieve all (or at least the minimal and maximal) ids which were assigned to the data imported using the LOAD DATA LOCAL INFILE... statement as mentioned above?

Community
  • 1
  • 1
quapka
  • 2,799
  • 4
  • 21
  • 35
  • 1
    `LOAD DATA INFILE` into a separate table then `UPDATE destination_table d JOIN separate_table s ON d.field = s.field SET d.other_field = s.other_field` – axiac May 30 '16 at 12:45
  • @axiac Care to explaine more? – quapka May 30 '16 at 12:52
  • I rushed with a suggestion based on the sentence *"I want to update only part of the table."* from the question. I see now you don't want to update but to replace some rows. However, I wrote an answer. Option #2 worked fine for me in the past (but my goal was to update some record, not to replace them). – axiac May 30 '16 at 19:15
  • Thanks @axiac for your help. Adding new column holding the data source is probably the easiest and simplest idea. Thanks for your suggestions. I am going to try that and accept your answer if I manage to implement it (otherwise ask you for details). Anyway, thanks. – quapka May 31 '16 at 09:12

1 Answers1

1

If you need to remember the source of each record from the table then you better store the information in a field.

I would add a new field (src) of type TINYINT to the table and store the ID of the source (1 for file_1, 2 for file_2 a.s.o.). I assume there won't be more than 255 sources; otherwise use SHORTINT for its type.

Then, when you need to update the records imported from file_2 you have two options:

  1. delete all the records having src = 2 then load the new records from file into the table; this is not quite an update, it is a replacement;
  2. load the new records from file into a new table then copy from it the values you need to update the existing records.

Option #1

Deletion is an easy job:

DELETE FROM table_1 WHERE src = 2

Loading the new data and setting the value of src to 2 is also easy (it is explained in the documentation):

LOAD DATA INFILE 'file.txt'
INTO TABLE table_1
  (column1, column2, column42)    # Put all the columns names here
                                  # in the same order the values appear in the file
SET src = 2                       # Set values for other columns too

If there are columns in the file that you don't need then load their values into variables and simply ignore the variables. For example, if the third column from the file doesn't contain useful information you can use:

INTO TABLE table_1 (column1, column2, @unused, column42, ...)

A single variable (I called it @unused but it can have any name) can be used to load data from all the columns you want to ignore.

Option #2

The second option requires the creation of a working table but it's more flexible. It allows updating only some of the rows, based on usual WHERE conditions. However, it can be used only if the records can be identified using the values loaded from the file (with or without the src column).

The working table (let's name it table_w) has the columns you want to load from the file and is created in advance.

When it's the time to update the rows imported from file_2 you do something like this:

  1. truncate the working table (just to be sure it doesn't contain any leftovers from a previous import);
  2. load the data from file into the working table;
  3. join the working table and table_1 and update the records of table_1 as needed;
  4. truncate the working table (cleanup of the current import).

The code:

# 1
TRUNCATE table_w;

# 2
LOAD DATA INFILE 'file.txt'
INTO TABLE table_w
    (column_1, column_2, column 42);       # etc

# 3
UPDATE table_1 t
INNER JOIN table_w w
    ON t.column_1 = w.column_1
    # AND t.src = 2                      # only if column_1 is not enough
SET t.column_2  = w.column_2,
    t.column_42 = w.column_42
# WHERE ... you can add extra conditions here, if needed

# 4
TRUNCATE TABLE table_w
Community
  • 1
  • 1
axiac
  • 68,258
  • 9
  • 99
  • 134
  • Thanks, adding separate column, which identifies the data source is easy and readable solution throughout my code! How simple and clever. – quapka Jun 01 '16 at 09:38