1

In a previous question the awesome amdixon was able to come up with a query to calculate the level of repetition of IPs.

I've adapted this to look at specific accounts using WHERE earning_account_id = ?:

SELECT MAX(repeat_factor)
FROM
(
SELECT earning_ip, count(*) / rc.row_count AS repeat_factor
FROM earnings
CROSS JOIN (SELECT count(*) AS row_count FROM earnings WHERE earning_account_id = ?) rc
WHERE earning_account_id = ?
GROUP BY earning_ip
) q

However, now I want to add an additional level of security.

I want to apply the same sort of query. But instead of limiting it to an earning_account_id, I want to limit it to any account grouping that has a sighting of a specific ip address.

This is so I can have better global detection of proxy spam, if they use multiple alt accounts.

Please note that I will no longer be limiting the query using WHERE earning_account_id = ?

So in other words if the ip_address was "45.55.80.86"

+--------------------+-------------+---------------------------+
| earning_account_id | earning_ip  | select row for repeat_factor query?|
+--------------------+-------------+---------------------------+
|                  1 | 45.55.80.86 | YES                       |
|                  1 | 45.55.80.86 | YES                       |
|                  2 | 1.22.83.65  | NO                        |
|                  2 | 91.15.76.37 | NO                        |
|                  3 | 45.55.80.86 | YES                       |
|                  4 | 61.25.76.37 | YES                       |
|                  4 | 1.22.83.65  | YES                       |
|                  4 | 45.55.80.86 | YES                       |
|                  5 | 61.25.76.37 | NO                        |
+--------------------+-------------+---------------------------+

The value to be returned would be the repeat_factor for all earnings of this ip, but ignoring all accounts that have never included this ip address.

In other words, what I'm trying to find out is:

"how repetitive is this IP address within all accounts, but looking only at the accounts where this IP address has been sighted?"

Community
  • 1
  • 1
Amy Neville
  • 10,067
  • 13
  • 58
  • 94
  • What does sighted mean? – Strawberry Jan 09 '16 at 09:18
  • Sighted means that the IP was used in a row containing that earning_account_id – Amy Neville Jan 09 '16 at 09:20
  • Basically I have a group of people cooperating to form a botnet of sorts. I want to identify them using my reapeat_factor algorithm. But if I apply the query to my entire table then they get lost in the haystack. If I target my query just to accounts where the IP has been sighted I get a much better measure of how repetitively this IP has been used.... – Amy Neville Jan 09 '16 at 09:25
  • Some douchebag downvoted me because I didn't accept his answer :( – Amy Neville Jan 09 '16 at 11:29
  • What would the desired result set look like? – Strawberry Jan 09 '16 at 12:57
  • @Strawberry I seem to remember your name from helping me in the past! Sorry but this time Zsolt seems to have beaten you to it :) – Amy Neville Jan 09 '16 at 14:38

2 Answers2

2

The rows to be selected can be obtained simply enough:

select e.*
from example e
join 
(select distinct earning_account_id
 from example
 where ip = '45.55.80.86') subq
on e.earning_account_id = subq.earning_account_id;

At this point, if it was a SQL Server database you could simply bundle this into a Common Table Expression (CTE) and use its alias instead of the two references to the table name in amdixon's query. Unfortunately MySQL doesn't provide such a luxury so we are confined to subqueries, each of which must have a unique alias - so a bit uglier but this does the job:

select max(repeat_factor)
from
(
select ip, count(*) / rc.row_count as repeat_factor
from
(select e.*
 from example e
 join 
 (select distinct earning_account_id
  from example
  where ip = '45.55.80.86') subq
 on e.earning_account_id = subq.earning_account_id) cte1
cross join ( select count(*) as row_count from 
(select e.*
 from example e
 join 
 (select distinct earning_account_id
  from example
  where ip = '45.55.80.86') subq
 on e.earning_account_id = subq.earning_account_id) cte2
) rc
group by ip
) q;

See SQL Fiddle Demo

Community
  • 1
  • 1
Steve Chambers
  • 37,270
  • 24
  • 156
  • 208
1

Update

With an idea taken from How to get multiple counts with one SQL query? and the answer from @SteveChambers, we can simplify this even further.

SELECT sum(CASE WHEN earning_ip = ? THEN 1 ELSE 0 END) / count(*)
FROM earnings WHERE earning_account_id IN (
    SELECT DISTINCT earning_account_id FROM earnings WHERE earning_ip = ?
)

This also gives 0.6667 using the example IP 45.55.80.86.

I leave the original answer here as parts of that might be useful for another queries.

Original Answer

By modifying the sub-queries and working my way up, the following would return the IDs for a given IP.

SELECT earning_account_id
FROM earnings WHERE earning_ip = ?
GROUP BY earning_account_id

In case of the example where the IP is 45.55.80.86, the query would return 1, 3, 4.

Then count the occurrences of the given IP by IDs.

SELECT earning_account_id, count(earning_ip) AS occurrence
FROM earnings
WHERE earning_account_id IN (
    SELECT earning_account_id
    FROM earnings WHERE earning_ip = ?
    GROUP BY earning_account_id
) AND earning_ip = ?
GROUP BY earning_account_id

In case of the example, this would return 1 => 2, 3 => 1, 4 => 1

Then also count the number of all IPs for these IDs and join this with the previous result.

SELECT e.earning_account_id, count(e.earning_account_id) AS ip_count, o.occurrence
FROM earnings e
CROSS JOIN (
    SELECT earning_account_id, count(earning_ip) AS occurrence FROM earnings
    WHERE earning_account_id IN (
        SELECT earning_account_id FROM earnings WHERE earning_ip = ?
        GROUP BY earning_account_id
    ) AND earning_ip = ?
    GROUP BY earning_account_id
) o
WHERE e.earning_account_id = o.earning_account_id
GROUP BY e.earning_account_id

In case of the example, all IPs by account are 1 => 2, 3 => 1, 4 => 3.

Finally, divide the sum of all occurrences with the sum of all IPs in this subset of rows.

SELECT sum(q.occurrence) / sum(q.ip_count) FROM (
    SELECT e.earning_account_id, count(e.earning_account_id) AS ip_count, o.occurrence
    FROM earnings e
    CROSS JOIN (
        SELECT earning_account_id, count(earning_ip) AS occurrence FROM earnings
        WHERE earning_account_id IN (
            SELECT earning_account_id FROM earnings WHERE earning_ip = ?
            GROUP BY earning_account_id
        ) AND earning_ip = ?
        GROUP BY earning_account_id
    ) o
    WHERE e.earning_account_id = o.earning_account_id
    GROUP BY e.earning_account_id
) q

In case of the example, this would return 0.6667, which corresponds to 4 occurrences in the 6 rows marked as YES above.

Community
  • 1
  • 1
zsltg
  • 725
  • 5
  • 14
  • 23