3

We have an iPhone app that sends invoice data by each of our employees several times per day. When they are in low cell signal areas tickets can come in as duplicates, however they are assigned a unique 'job id' in the mysql database, so they're viewed as unique. I could exclude the job id and make the rest of the columns DISTINCT, which gives me the filtered rows I'm looking for (since literally every data point is identical except for the job id), however I need the job ID since it's the primary reference point for each invoice and is what I point to for: approvals, edits, etc.

So my question is, how can I filter out 'near' duplicate rows in my query, while still pulling in the job id for each ticket?

The current query is below:

SELECT * FROM jobs, users
WHERE jobs.job_csuper = users.user_id
AND users.user_email = '".$login."'
AND jobs.job_approverid1 = '0'

Thanks for looking into it!

Edit (examples provided): This is what I meant by 'near duplicate'

Job_ID - Job_title - Job_user - Job_time - Job_date
2345 - Worked on circuits - John Smith - 1.50 - 2013-01-01
2344 - Worked on circuits - John Smith - 1.50 - 2013-01-01
2343 - Worked on circuits - John Smith - 1.50 - 2013-01-01

So everything is identical except for the Job_ID column.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786

3 Answers3

1

You want a group by:

SELECT *
FROM jobs, users
WHERE jobs.job_csuper = users.user_id
AND users.user_email = '".$login."'
AND jobs.job_approverid1 = '0'
group by <all fields from jobs except jobid>

I think the final query should look something like this:

select min(Job_ID) as JobId, Job_title, user.name as Job_user, Job_time, Job_date
FROM jobs join users
     on jobs.job_csuper = users.user_id
WHERE jusers.user_email = '".$login."' AND jobs.job_approverid1 = '0'
group by Job_title, user.name, Job_time, Job_date

(This uses ANSI syntax for joins and is explicit about the fields coming back.)

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1
  1. It's better to prevent the double submission.
  2. Given that you cannot prevent the double submission...

I would query like this:

select
   min(Job_ID)          as real_job_id
  ,count(Job_ID)        as num_dup_job_ids
  ,group_concat(Job_ID) as all_dup_job_ids
  ,j.Job_title, j.Job_user, j.Job_time, j.Job_date
from
  jobs j
  inner join users u on u.user_id = j.job_csuper
where
  whatever_else
group by
  j.Job_title, j.Job_user, j.Job_time, j.Job_date

That includes more than you explicitly asked for. But it's probably good to be reminded of how many dups you have, and it gives you easy access to the duplicate id info when you need it.

mdahlman
  • 9,204
  • 4
  • 44
  • 72
0

How about creating a hash for each row and comparing them:

`SHA1(concat_ws(field1, field2, field3, ...)) AS jobhash`
paul
  • 21,653
  • 1
  • 53
  • 54