Excluding 'near' duplicates from a mysql query

Question

We have an iPhone app that sends invoice data by each of our employees several times per day. When they are in low cell signal areas tickets can come in as duplicates, however they are assigned a unique 'job id' in the mysql database, so they're viewed as unique. I could exclude the job id and make the rest of the columns DISTINCT, which gives me the filtered rows I'm looking for (since literally every data point is identical except for the job id), however I need the job ID since it's the primary reference point for each invoice and is what I point to for: approvals, edits, etc.

So my question is, how can I filter out 'near' duplicate rows in my query, while still pulling in the job id for each ticket?

The current query is below:

SELECT * FROM jobs, users
WHERE jobs.job_csuper = users.user_id
AND users.user_email = '".$login."'
AND jobs.job_approverid1 = '0'

Thanks for looking into it!

Edit (examples provided): This is what I meant by 'near duplicate'

Job_ID - Job_title - Job_user - Job_time - Job_date
2345 - Worked on circuits - John Smith - 1.50 - 2013-01-01
2344 - Worked on circuits - John Smith - 1.50 - 2013-01-01
2343 - Worked on circuits - John Smith - 1.50 - 2013-01-01

So everything is identical except for the Job_ID column.

how do you if they are near duplicate? can you give sample records in tabular formt? and also your desired result. — John Woo, Feb 06 '13 at 16:24
Step 1 would be to specify exactly what you mean by "near duplicate". — Dan Bracuk, Feb 06 '13 at 16:25
Something similar to this: http://stackoverflow.com/a/1895149/128217 — zimdanen, Feb 06 '13 at 16:25
Also, at the commenters - OP specified that "near duplicate" means everything matches other than job id. — zimdanen, Feb 06 '13 at 16:26
Looks more reliable to prevent double submission in the first place. I know nothing about iPhones but Googling for **Nonce** might give you some ideas. — Álvaro González, Feb 06 '13 at 16:27

score 1 · Accepted Answer · answered Feb 06 '13 at 16:35

You want a group by:

SELECT *
FROM jobs, users
WHERE jobs.job_csuper = users.user_id
AND users.user_email = '".$login."'
AND jobs.job_approverid1 = '0'
group by <all fields from jobs except jobid>

I think the final query should look something like this:

select min(Job_ID) as JobId, Job_title, user.name as Job_user, Job_time, Job_date
FROM jobs join users
     on jobs.job_csuper = users.user_id
WHERE jusers.user_email = '".$login."' AND jobs.job_approverid1 = '0'
group by Job_title, user.name, Job_time, Job_date

(This uses ANSI syntax for joins and is explicit about the fields coming back.)

mdahlman · Answer 2 · 2013-02-08T06:43:05.157

It's better to prevent the double submission.
Given that you cannot prevent the double submission...

I would query like this:

select
   min(Job_ID)          as real_job_id
  ,count(Job_ID)        as num_dup_job_ids
  ,group_concat(Job_ID) as all_dup_job_ids
  ,j.Job_title, j.Job_user, j.Job_time, j.Job_date
from
  jobs j
  inner join users u on u.user_id = j.job_csuper
where
  whatever_else
group by
  j.Job_title, j.Job_user, j.Job_time, j.Job_date

That includes more than you explicitly asked for. But it's probably good to be reminded of how many dups you have, and it gives you easy access to the duplicate id info when you need it.

score 0 · Answer 3 · answered Feb 06 '13 at 16:30

0

How about creating a hash for each row and comparing them:

`SHA1(concat_ws(field1, field2, field3, ...)) AS jobhash`

answered Feb 06 '13 at 16:30

paul

21,653
1
53
54

Excluding 'near' duplicates from a mysql query

3 Answers3