How can I deal with large data for this complicated scenario ? Recursive CTE and Pandas are not working?

Question

My scenario:

User A is (fraudster).
User B is not (fraudster). However, the system will not allow user B to do any action. Because B and A are using the same Phone Number(Shared attribute with Fraud User). (1 layer).
User D is not (fraudster). But D is using the same Deviceid with B and B is sharing attribute with fraud User. Then block User D as well. In this case, there are 2 layers. D compare with B, B compares with A.

Recursive CTE (Error when I increased the data to 1000 rows)

Code:

with recursive cte as (
      select ID, Email, MobileNo, DeviceId, IPAddress, id as tracking
      from tableuser
      where isfraudsterstatus = 1
      union all
      select u.id, u.email, u.mobileno, u.deviceid, u.ipaddress , concat_ws(',', cte.tracking, u.id)
      from cte join
           tableuser u
           on u.email = cte.email or
              u.mobileno = cte.mobileno or
              u.deviceid = cte.deviceid or 
              u.ipaddress = cte.ipaddress
      where find_in_set(u.id, cte.tracking) = 0
     )
select *
from cte;

Error:

Using Pandas:(error when i increased the data from 1000rows to 500000rows)

Code:

import mysql.connector
from mysql.connector import Error
import pandas as pd
#DATABASE CONNECTION
##
try:
    connection = mysql.connector.connect(host='localhost',
                                         database='database',
                                         user='root',
                                         password='')
    cursor = connection.cursor()
    #Create Dataframe (temporary data)
    #df = pd.read_sql("select * from MOCK_DATA",con=connection)
    df = pd.read_sql("select * from tableuser",con=connection)
##
    def expand_fraud(no_fraud, fraud, col_name):
        t = pd.merge(no_fraud, fraud, on=col_name)
        if len(t):
            df.loc[df.ID.isin(t.ID_x), "IsFraudsterStatus"] = 1
            return True
        return False

    while True:
        added_fraud = False
        fraud = df[df.IsFraudsterStatus == 1]
        no_fraud = df[df.IsFraudsterStatus == 0]
        added_fraud |= expand_fraud(no_fraud, fraud, "DeviceId")
        added_fraud |= expand_fraud(no_fraud, fraud, "Email")
        added_fraud |= expand_fraud(no_fraud, fraud, "MobileNo")
        if not added_fraud:
            break
    print(df)

    Id_list = df.values.tolist()

except Error as e:
    print("Error reading data from MySQL table", e)
finally:
    if (connection.is_connected()):
        connection.close()
        cursor.close()
        print("MySQL connection is closed")

Error

How to deal with this ? Is there any alternative way to do that ?

Your panda code and your mysql code have different problems (and solutions) and are basically only related because they are trying to solve the same problem (and both don't work), so you should split them into two different question. Also, 500k rows is not "bigdata", so you may want to remove that tag. — Solarflare, Jul 17 '20 at 10:57
@Solarflare Noted. I will split it in another question. For bigdata's tag has been removed. Thanks for the correction. — ABDULAZIZ NOREDIN QADMOR, Jul 17 '20 at 12:13

Solarflare · Answer 1 · 2020-07-17T13:09:39.233

The problem here (for the MySQL part) seems to be your stop condition. You keep track of the list of ids to prevent infinite loops (e.g. A,B,C,D). Unfortunately, that column will have the datatype of "id", maybe varchar(10), which effectively means your track list has a limited length.

If you reach that depth, you should get an error message:

Error Code: 1406. Data too long for column 'tracking' at row 1

Also unfortunately, you probably suppressed that error by disabling strict mode, e.g. by using sql_mode='', which is a common method to get rid of some problems (most infamously group by-errors) instead of fixing the code, but has has the side effect that you may get invalid data.

In your case, this has the effect that the tracking value stops tracking (without throwing an error), e.g. with varchar(10) might end up as A,B,C,D,E,, unable to add F to the list, so it keeps adding F to the result set, resulting in an infinite loop.

MySQL actually has a protection against infinite loops, so you may get

Error Code: 3636. Recursive query aborted after 1001 iterations. 
Try increasing @@cte_max_recursion_depth to a larger value.

but it only protects you in specific situations, as if you add more than one row per iteration, which each then add more than one row per iteration, you'll reach your resource limits (or a timeout) well before 2^1000 rows in your result set.

How to fix it?

If you do not actually need the information from the tracker (and since your panda code doesn't do it, it seems you only added it to prevent loops), you could let MySQL handle duplicates by using union distinct:

with recursive cte as (
  select ID, Email, MobileNo, DeviceId, IPAddress
  from tableuser
  where isfraudsterstatus = 1
  union distinct  -- distinct!
  select u.id, u.email, u.mobileno, u.deviceid, u.ipaddress
  from cte join tableuser u
  on u.email = cte.email or u.mobileno = cte.mobileno 
     or u.deviceid = cte.deviceid or u.ipaddress = cte.ipaddress
)
select * from cte;

If you want, you could also extend that to track the "original fraudster". If you have several fraudsters per chain (e.g. A and B both are marked as fraudster, while A has the same MobileNo as B), this may result in duplicates, but you can get rid of those again with a group by:

with recursive cte as (
  select ID, Email, MobileNo, DeviceId, IPAddress, id as original_fraudster
  from tableuser
  where isfraudsterstatus = 1
  union distinct
  select u.id, u.email, u.mobileno, u.deviceid, u.ipaddress,
     cte.original_fraudster
  from cte join tableuser u
  on u.email = cte.email or u.mobileno = cte.mobileno 
     or u.deviceid = cte.deviceid or u.ipaddress = cte.ipaddress
)
select ID, Email, MobileNo, DeviceId, IPAddress, 
   min(original_fraudster) as original_fraudster
from cte
group by ID, Email, MobileNo, DeviceId, IPAddress;

Technically, you can also avoid the original problem (of the limited lenght from the "id" column) by explicitly defining your own length, e.g.

with recursive cte as (
  select ID, Email, MobileNo, DeviceId, IPAddress, 
     cast(id as char(1000)) as tracking

although this just shifts the problem to a time in the future when this lenght may not be long enough either, but you may be able to judge if that is a potential problem or not.

Thank you so much, I got it now. I will test and let you know if it works. — ABDULAZIZ NOREDIN QADMOR, Jul 17 '20 at 12:14

How can I deal with large data for this complicated scenario ? Recursive CTE and Pandas are not working?

1 Answers1