CSV and ID problems

Question

I have a database with employees in it.

Since my employer finds it easy to input the data in a CSV file, I wrote a program that truncates my database and inserts the CSV data in my DB.

Employee: [ID, LAST_NAME, NAME, EMAIL, REMARKS, ...]

I use the field ID, (which is an auto_increment value) to make all my employee's unique. This works fine, however recently my employer has asked me too to include a functionality to mark favorites.

The only thing which makes my employees unique is the ID key thus when I update the new CSV file the ID's go all broke and are shifted since I had to truncate my database and the favorites don't match up any more.

An example of what I mean (CSV file):

0, Carlton, John,  john@gmail.com,    "Great worker",
1, Awsome, Dude,  awsomeDud@aol.com, "Not so great",
2, Random, Randy, rr@hotmail.com,    "idk"

Suppose somebody deletes the record with ID 1.
And my favorite was 1, the csv file however will now look like this:

0, Carlton, John,  john@gmail.com,    "Great worker",
1, Random, Randy, rr@hotmail.com,    "idk"

It points to the wrong person.
Keep in mind that the ID's I wrote are not part of the csv file itself
they are the auto_increment value.

I have given this problem a lot of thought and I cannot seem to find a simple way to accomplish this.

Any help would be appreciated.

Notes:

Emails are not unique, nor required.
The only real unique field is the ID field.

Can you post short example files ? I don't understand what you mean by 'IDs go all broke'. — Manuel Arwed Schmidt, Jun 22 '14 at 20:17
"Keep in mind that the ID's I wrote are not part of the csv file itself" --- so do them then — zerkms, Jun 22 '14 at 20:31
Well, you need to make the ID column persistent somehow. I suspect you don't export the sql table id's to the csv but just count them up in your php script when doing an export? MySQL auto-incremented id's never show up twice unless you explicitly tell mysql to do so by doing a full table flush. — Manuel Arwed Schmidt, Jun 22 '14 at 20:31
Yes the problem lies in the ID binding. And yes I flush the entire table, if you remove 1 row from the csv it would think the rest are just edits. If I knew how to fix that I woulden't be asking. — user3152069, Jun 22 '14 at 20:34
I see, the csv is the input data. Then you need to build up a hash on the data or something, make it an extra column and use that as an id. I wouldn't use the email alone as a unique feature, because it might be empty for multiple entries someday. — Manuel Arwed Schmidt, Jun 22 '14 at 20:37
I don't see the ID field being unique in the setup you described. Maybe there's a way to make Excel or whatever is used to make the csv give some unique id. You can't point to favorites without having atleast some consistent reference base (most likely an ID). — Manuel Arwed Schmidt, Jun 22 '14 at 20:40
Well, if there's no way of uniquely identifying an employee, then there's no way to solve this. If two employees are called "John Smith" and there's no other field that could be guaranteed unique for them, then there's no solution. (And using names as a unique ID sucks for other reasons, too, because they change over time, e.g. when people get married.) This is why companies have an employee ID for people. The only real solution is to assign a unique ID to each employee in the source CSV and make sure it's kept consistent by whatever's generating it (be it manual or automatic.) — Matt Gibson, Jun 25 '14 at 16:04
To draw smarter people in then myself I would prob need to make the bounty bigger xD. — user3152069, Jun 25 '14 at 17:16
Do you have access to change the employee table? Can you add another column? — Jenn, Jun 25 '14 at 21:22
This isn't about the bounty. You're assuming your lack of data can be solved by adding more code, whereas actually your lack of data can only truly be solved by adding more data. — Matt Gibson, Jun 26 '14 at 06:38
Why you are not making unique column by concatenating name with email. — Syed Daniyal Asif, Jun 26 '14 at 08:36
Why do you not create a GUI so you have never have to clear the data anymore and make your boss happy? It's much easier and faster than a CSV. — GuyT, Jun 26 '14 at 12:48
You could try changing the way your delete works by adding another column to the table called 'Active'. Rather than deleting the entry, you can set active = 0. Then, you will have to change all of your queries to ignore anything where active = 0. However, you will not have the issue with the ids changing because you will never actually be removing entries from the table. This looks the same as a delete to the user because they will never see any of the inactive records. — Jenn, Jun 26 '14 at 14:28
i think there is a query to update table to match to another table base on ID'S http://stackoverflow.com/questions/3742893/mysql-update-all-rows-in-a-table-matching-results-of-another-query http://stackoverflow.com/questions/5727827/update-one-mysql-table-with-values-from-another — rm_beginners, Jul 02 '14 at 03:25
if there are 3 records with ids 1,2,3 in DB(id field is AUTO_INCREMENT) and we delete 3 and add new it will assign 4 to new record where is the issue of changing the id or data shift here? — Maulik Shah, Jul 02 '14 at 08:59
Questions: Which DB engine are you using? When you say truncate do you mean delete all content? Do you implement all updates to the database table by deleting the table and importing a new updated CSV? Are you tracking favorites in a separate table? The answers are varied because your question is not well defined. — Doug Coburn, Jul 02 '14 at 16:07

score 1 · Answer 1 · answered Jul 02 '14 at 14:35

Solution 1 (easiest)

Have an int is_favorite column in your database containing 1 or 0, with a default value of 0 (meaning is not a favorite). Then ask your client to slightly change the format of the csv file as follows:

Employee: [ID, LAST_NAME, NAME, EMAIL, REMARKS, FAVORITE, ...]

Example CSV:

0, Carlton, John,  john@gmail.com,   "Great worker", 1
1, Awsome, Dude,  awsomeDud@aol.com, "Not so great", 0 
2, Random, Randy, rr@hotmail.com,    "idk"

When you process the CSV file, depending on the FAVORITE column just set the same value in the database. This will eliminate the problem with the mismatched favorites. Unfortunately, if in the near feature, the client requires new features which depend on the favorites, you might have the same issue again.

Solution 2 (best)

Discuss a more mature solution with your client pointing out the current CSV solution is no longer a valid option due to the issue with matching the CSV users with the appropriate sub features (i.e. favorites)

score 0 · Answer 2 · edited May 23 '17 at 12:10

0

A possible solution would be to never truncate your table. Ever. Find out what makes the employees unique. E.g. EMAIL.

Then when you parse the next CSV's, you don't simply INSERT the employees. You update the current ones and insert the new ones.

This way, your IDs always stay the same (which they should). I would have used something like this:

IF EXISTS (SELECT 1 FROM [User] WHERE [Email] = @UsersEmail)
BEGIN
    UPDATE [User]
    SET [Name] = @NewName
    WHERE [Email] = @UsersEmail
END
ELSE
BEGIN
    INSERT INTO [User] ([Email], [Name]) VALUES
    (@UsersEmail, @NewName)
END

But since you've tagged it PHP, I'm guessing you're using MySQL. Which can do it differently (from here):

INSERT INTO subs
  (subs_name, subs_email, subs_birthday)
VALUES
  (?, ?, ?)
ON DUPLICATE KEY UPDATE
  subs_name     = VALUES(subs_name),
  subs_birthday = VALUES(subs_birthday)

edited May 23 '17 at 12:10

Community

1
1

answered Jun 22 '14 at 20:31

Trafz

636
3
13

Unfortunately, you're looking for something to make those rows unique. That's where the solution lies. Either see if the one who gives you the CSV can add more data to it (hopefully something there will be unique, e.g. CPR number or something), or you'll simply need to assume that e.g. a combination of `NAME`+`LAST_NAME`+`EMAIL` is unique. That's called a composite ID. – Trafz Jun 22 '14 at 20:44
2

Even so, that's the only way to solve it. If that's not possible, you need say that the requested feature isn't possible until THEY provide some unique IDs for their employees. IDs which don't ever change for each person. – Trafz Jun 22 '14 at 20:48

score 0 · Answer 3 · answered Jun 27 '14 at 09:59

I would not truncate the table. I would then upload the csv into a temporary table. If the same ID is in both tables, do an update. If it is only in the old version, delete it (deleting out favorites as well for that ID), or, perhaps better, have a flag on the employees table that deactivates the row. If it is only in the new version, insert everything except the ID (which will probably be an empty string anyway). Then you can delete the temporary table.

If you want to be paranoid, you can double check names or emails and if you find a mismatch, flag them without updating. That would cause a manual operation if someone changed their name, but it would also save you the trouble if someone messed up your id numbers.

score 0 · Answer 4 · answered Jun 27 '14 at 15:07

The simple and clean way to solve this would be to find a way to recognise unique employees on the flat data.

Is there no other unique identifier that could be added to the csv file? For example, a windows login name? A company employee No? Something that would be static.

That way it's simple:

1, Don't truncate.

2, If Windows LoginID / EmpNo exists, update.

3, If not, add.

Also I'm concerned that your "favourites" table is clearly not using referential integrity. It should have a FK pointing to your Employee.ID; preventing you accidentally deleting an employee that was marked as a favourite, amongst other things.

A messier, much less bullet proof way, would be to mark your favourites based on your employee names rather than IDs. There are obvious draw backs to this approach, so use as last resort.

score 0 · Answer 5 · answered Jun 28 '14 at 16:25

You should never use the ID to identify a given user for the reasons you described in the question.

You could create a new reference ID field based on what you already have and create a unique identifier by chaining the required fields as a single string and then calculating the MD5 hash for example.

I have a question (sorry but I can't comment - rep): your employer adds only new employees via CSV file or even edit existing ones?

If only new employees are added you don't need to rebuild the table from scratch and you can make sure that your program generates a unique reference ID (that will remain unchanged) before inserting data into the db. Also your program can handle the editing of the employee, instead of changing data from CSV, leaving reference ID untouched.

This way all the fields like name, email, etc. can be edited and the link to favourites will stay correct. In that case the reference ID can also be calculated using not only data on the CSV but other like creation timestamp.

score 0 · Answer 6 · answered Jun 28 '14 at 16:40

0

You could create MD5 hash from name , email and the comment , save and use that as unique identifier .

Make sure you store MD5 hash as binary

answered Jun 28 '14 at 16:40

Golu

414
5
16

score 0 · Answer 7 · answered Jun 30 '14 at 12:05

Can you modify database? If you can, add another field that you can call favourite. Set it to simple enum (1,0) and set 1 for favourites, 0 for others. So when, you truncate database, you'll still have your favourites by those fields. Of course if you have multi-level favourites, don't set field to enum, set it to something else, more suitable for you.

score 0 · Answer 8 · answered Jul 01 '14 at 16:03

One solution is that the database becomes the defacto 'source' for IDs.

After the initial import, the next time your boss wants to update the file, create a CSV FROM the database (with the ID's intact) and ask your boss to update that and return it.

You could ask him to add new rows to the bottom of the file and leave out the ID.

Any row in the new spreadsheet without an ID is a new record. An extra field at the end of the row could be used by the boss to indicate rows to be deleted.

Repeat this process the next time the boss wants to update the file.

score 0 · Answer 9 · answered Jul 02 '14 at 05:12

0

Add an extra field to your database table as well as to the CSV file named something like "EmployeeID" which should be unique for all employees.

answered Jul 02 '14 at 05:12

Raouf Athar

1,803
2
16
30

CSV and ID problems

9 Answers9