0

I have a DataTable with around 500K+ records, and I would like to update a field in the database for each of those records contained in the DataTable.

At the moment, I am getting a list of all the records in the ID column of the DataTable and creating an update statement with an SQL in statement on those IDs, like so:

string sql = "UPDATE my_table SET my_field = @timestamp WHERE id IN (" + String.Join(", ", myDataTable.AsEnumerable().Select(r => r.Field<Int64>("id")).ToList().ToArray()) + ")";

This works for a few records, however due to the large amount of IDs, I am now getting an error saying that the packet size in MySQL is not large enough for this query.

What can be done, besides increasing the packet size, which seems to be a bad approach?

Horace
  • 1
  • How you obtain then result of IN clause ..? is possible get the same result using a query?? – ScaisEdge Mar 28 '17 at 07:23
  • I'd start by not building up your query in this fashion. Since you've tagged this C#, I would assume you have access to ADO.NET, which means you can parameterize your query. – Tieson T. Mar 28 '17 at 07:24
  • I second Tieson T. comment. Do you want to update all records in that table? Is the list of IDs to update in another table? If so, use a join to select which records to update. – Sloan Thrasher Mar 28 '17 at 07:26
  • I do not want to update all records in the table, but only the ones contained in the datatable. – Horace Mar 28 '17 at 07:27
  • @TiesonT. - As far as I know, MySQL does not support "IN" statements with parameterized queries, which is why I am using concatenation. – Horace Mar 28 '17 at 07:28
  • It is not that MySql doesn't support parameters in the IN statement, but you should create a parameter for each IN value. However this is not the point of your problem – Steve Mar 28 '17 at 07:32
  • How do you have loaded the id table? Are those ID somewhere on the database and can you create a temporary table on the database for them instead of loading them in memory? – Steve Mar 28 '17 at 07:34
  • Have you tried increasing the packet since in MySql ? http://stackoverflow.com/questions/8062496/how-to-change-max-allowed-packet-size – S.N Mar 28 '17 at 07:34
  • I can create a temporary table, yes. However I have never done that before as do not like the idea very much. – Horace Mar 28 '17 at 07:35
  • @Nair I have not, as I do not think increasing the size is a reliable solution, since I cannot predict how much data I will have in the data table – Horace Mar 28 '17 at 07:36
  • @Horace, In that scenario, you should be looking at supplying data in chunk according to the current size. Of course, you are expect to supply chunk individually rather with a single Join. You may also look at how you could partition your enumerable collection. – S.N Mar 28 '17 at 07:58
  • @Nair could you give a tangible example as an answer? – Horace Mar 28 '17 at 08:32
  • @Horace, I am not going to code for you. But a definite head start would be to partition your 'Select(r => r.Field("id")).ToList().ToArray())' extension. Google it, you get lot of example on how to work on data partition or building data chunk. – S.N Mar 28 '17 at 09:48

3 Answers3

1

For similar task I used the following approach once:

  1. Create additional table with columns session_id GUID, record_id bigint.
  2. Before starting update operation generate unique identifier for this operation (session_id).
  3. Insert all ids you have to update into this table along with generated session id.
  4. UPDATE primary table using INNER JOIN to this table, specifying particular session id.
  5. DELETE all records from session table by session id.

When it's done you may start working on performance:

  • turning logging for session table off, since it doesn't contain any essential data;
  • experiment with creating temp table each time you need it instead of using static one (in theory it would simplify cleaning up as DROP TABLE has to work much faster then DELETE FROM, but in my case creating table worked too slow so I preferred using permanent table);
  • use bigint instead of GUID-s for identifying sessions as they have to be compared faster;
  • use COPY FROM (PostgreSQL) or BULK INSERT (MS SQL) statement instead of series of INSERT INTO-s. I didn't find anything similar for MySQL though.

I don't know what will suit you best )

Aleksei
  • 562
  • 4
  • 11
0

I don't know if it is possible to increase the packet size in MySQL, but there is a simple solution to your problem by dividing your UPDATE query to multiple UPDATE queries for multiple groups of IDs

Abdullah Dibas
  • 1,499
  • 1
  • 9
  • 13
  • Is there a clean way to do this? Other than explicitly iterating for a fixed number of times? – Horace Mar 28 '17 at 07:34
  • mmmmm I don't think there is other way, but you should try to check the maximum number of IDs at which your query will pass, then divide your IDs into groups based on that number. – Abdullah Dibas Mar 28 '17 at 07:40
0

Assuming id_list_table contains a list of IDs to update in my_table:

UPDATE my_table
SET my_field = @timestamp
JOIN id_list_table
ON my_table.id = id_list_table.my_table_id
Sloan Thrasher
  • 4,953
  • 3
  • 22
  • 40
  • It is a data table rather than a database table though – Horace Mar 28 '17 at 07:34
  • Then I'd suggest an approach like Abdullah Dibas mentions. Use a smaller number of record IDs and multiple update statements. Just loop through with X number of IDs per loop until you get through the whole list. – Sloan Thrasher Mar 28 '17 at 07:36