Count Duplicates Values within a time interval in PySpark

Question

I am working on a PySpark job with a large data in below format.

ID-1234567  iplong  agent   partner client  country timestamp   category    reference

I need to find the average amount of duplicate records based on columns 2(iplong), 3(agent), 5(client), 6(country), 9(reference) within one minute time intervals for each partner.

I understand that I need to

Divide records into one-minute intervals.
Map everything by partner
Group everything by partner
Reduce each interval by count of total records and by count of distinct records and take difference to get amount of duplicate records (Also need to define a function to compare two records only with the values of 2(iplong), 3(agent), 5(client), 6(country), 9(reference) columns.)
Add all the partner from all the intervals and their duplicate counts together. And divide by the count of their appearance.

I understand this process but not the exact implementation in pyspark.

Can someone please help me with the implementation of any of the above steps in pyspark.

Sample Data:

9794474 1000460030  Samsung_S5233   dv4gs   dswae   in  2012-03-08 00:00:00 mg  riflql2a0yv8xoa9sq0recx4x
9794471 3386480130  Nokia_C3-00 duq7h   dr75h   py  2012-03-08 00:00:00 co  
9794468 1907980030  Nokia_5233  dv6i3   ds3xq   vn  2012-03-08 00:00:00 es  gp53lqr9njqd6z2ap5d364sip
9794467 1791990020  MAUI    duxto   dvb8g   in  2012-03-08 00:00:00 ad  
9794466 1791000060  Nokia_3110c dusg4   dvb8g   in  2012-03-08 00:00:00 ad  
9794477 1353590020  Blackberry_9300 du6dt   dtr0u   es  2012-03-08 00:00:00 es  h5njsswvxorsau9u8fxh0e9se
9794478 1402290050  NokiaC6-01.3    dusnc   dsgcn   ru  2012-03-08 00:00:00 mc  
9794481 1848749950  Nokia_C3-00 dvry3   dr6sg   th  2012-03-08 00:00:01 mc  oj0rekb51pvirnjuqjt10zn4b

Update:

So far I've tried putting the whole data into MySQL and reading from it. But it takes too much time in read operations.

For mapreduce approach, I've tried different smaller things. But don't understand how am I going to approach it further in code. Hence, not able to move forward with one approach.

clicks_rdd = sc.parallelize(list(clicks_reader)[1:]) 
minwise_clicks = clicks_rdd.groupby(clicks_rdd.index.map(lambda t: t.minute)) # Didn't work
clicks_mapped_publishers = clicks_rdd.map(lambda x : (x.pop(3), x)) # Works fine but need the records divided into minute intervals first.

Have also tried some other things here and there. But nothing solid.

Following is the first 25 records of my original dataset file.

id,iplong,agent,partnerid,cid,cntr,timeat,category,referer
9794476,1071324855,SonyEricsson_K70,dv3va,dsfag,us,2012-03-08 00:00:00.0,ad,
9794474,1000461055,Samsung_S5233,dv4gs,dswae,in,2012-03-08 00:00:00.0,mg,riflql2a0yv8xoa9sq0recx4x
9794471,3386484265,Nokia_C3-00,duq7h,dr75h,py,2012-03-08 00:00:00.0,co,
9794468,1907981997,Nokia_5233,dv6i3,ds3xq,vn,2012-03-08 00:00:00.0,es,gp53lqr9njqd6z2ap5d364sip
9794467,1791989091,MAUI,duxto,dvb8g,in,2012-03-08 00:00:00.0,ad,
9794466,1791002478,Nokia_3110c,dusg4,dvb8g,in,2012-03-08 00:00:00.0,ad,
9794477,1353590316,Blackberry_9300,du6dt,dtr0u,es,2012-03-08 00:00:00.0,es,h5njsswvxorsau9u8fxh0e9se
9794478,1402285217,NokiaC6-01.3,dusnc,dsgcn,ru,2012-03-08 00:00:00.0,mc,
9794481,1848747204,Nokia_C3-00,dvry3,dr6sg,th,2012-03-08 00:00:01.0,mc,oj0rekb51pvirnjuqjt10zn4b
9794482,1893182670,NokiaC2-03,du77a,dr6x2,id,2012-03-08 00:00:01.0,co,r63f8uhijvr2irvka3glwyb38
9794483,1912930086,MAUI,dvwdj,dvb8g,id,2012-03-08 00:00:01.0,ad,
9794485,2098816838,GT-S5360B,dvjtq,dr72e,th,2012-03-08 00:00:01.0,co,
9794486,3309473440,MAUI,dv6i3,ds3k0,za,2012-03-08 00:00:01.0,es,
9794492,702295934,Nokia_9300,dv6i3,dtqrw,ng,2012-03-08 00:00:01.0,es,onbw7na2mi8a62g4p6y3av2qt
9794493,694135362,Nokia_N95,dupgf,dvb8g,sd,2012-03-08 00:00:01.0,ad,hoq05psulkszxm4izlql4g962
9794495,1791428359,Samsung_S8300,dvpo7,dvb8g,in,2012-03-08 00:00:02.0,co,im387req0zp1ucygamhgadgtm
9794496,1783607271,GT-S5570,du56s,dsgq2,in,2012-03-08 00:00:02.0,mc,immfap8948rebeym8ri0vf5cr
9794498,1860189232,Samsung_GT-B3313,du56s,ds22r,in,2012-03-08 00:00:02.0,mc,r81nrzjemr5jrfvjjeoxmdm4y
9794499,1868310973,Nokia_2730c,dv3va,drvnr,au,2012-03-08 00:00:02.0,ad,
9794500,1893182511,Nokia_5233,dv6i7,dr6tn,id,2012-03-08 00:00:02.0,co,tq09jycwii12iul7hzalucue3
9794501,1884230403,Samsung_GT-S3653,dvjil,ds92x,in,2012-03-08 00:00:02.0,mc,h0z1j3bwiverubvwg851e9eon
9794503,1945382244,GT-S5360,dvijt,dsgq2,in,2012-03-08 00:00:02.0,mc,fbbenjzmoe0oc7x4e2080nj8x
9794508,2928534854,Samsung_R310,dunsq,dsg3q,us,2012-03-08 00:00:02.0,ad,kl9j183hop90uwq2p82iidjsb
9794510,3063717709,Samsung_GT-S3653,dvjjf,dr751,in,2012-03-08 00:00:02.0,ad,rpdt9h4kpooxiedeuuxvk6gi5
9794511,3557769762,Samsung_C3050,du53k,dr71b,hr,2012-03-08 00:00:02.0,se,

Update 2

Sample output. This is a Tab Separated Values format. You can copy and paste it in excel to view properly. Here avg_spiky_ReAgCnIpCi is the average count of reference, Agent, Country, IP, Client combination repeating each second. Which I am interested in. And then I can make changes to derive other features.

partnerid   status  avg_spiky_ReAgCnIpCi    std_spiky_ReAgCnIpCi    night_avg_spiky_ReAgCnIpCi  night_std_spiky_ReAgCnIpCi  morning_avg_spiky_ReAgCnIpCi    morning_std_spiky_ReAgCnIpCi    afternoon_avg_spiky_ReAgCnIpCi  afternoon_std_spiky_ReAgCnIpCi  evening_avg_spiky_ReAgCnIpCi    evening_std_spiky_ReAgCnIpCi    avg_spiky_ReAgCnIp  std_spiky_ReAgCnIp  avg_spiky_ReAgCn    std_spiky_ReAgCn    avg_spiky_iplong    std_spiky_iplong    avg_spiky_agent std_spiky_agent night_avg_spiky_agent   night_std_spiky_agent   morning_avg_spiky_agent morning_std_spiky_agent afternoon_avg_spiky_agent   afternoon_std_spiky_agent   evening_avg_spiky_agent evening_std_spiky_agent avg_spiky_cid   std_spiky_cid   avg_spiky_cntr  std_spiky_cntr  avg_spiky_referer   std_spiky_referer   night_avg_spiky_referer night_std_spiky_referer morning_avg_spiky_referer   morning_std_spiky_referer   afternoon_avg_spiky_referer afternoon_std_spiky_referer evening_avg_spiky_referer   evening_std_spiky_referer   category_es category_mc category_ad category_co category_se category_mg category_pp category_in category_gd category_ow total_clicks    distinct_iplong distinct_agent  distinct_cid    distinct_cntr   distinct_referer    night_click_percent morning_click_percent   afternoon_click_percent evening_click_percent   night_referer_percent   morning_referer_percent afternoon_referer_percent   evening_referer_percent night_agent_percent morning_agent_percent   afternoon_agent_percent evening_agent_percent   avg_total_clicks    std_total_clicks    avg_distinct_iplong std_distinct_iplong avg_distinct_agent  std_distinct_agent  avg_distinct_cid    std_distinct_cid    avg_distinct_cntr   std_distinct_cntr   avg_distinct_referer    std_distinct_referer    avg_null_agent  std_null_agent  avg_null_referer    std_null_referer    night_avg_null_referer  night_std_null_referer  morning_avg_null_referer    morning_std_null_referer    afternoon_avg_null_referer  afternoon_std_null_referer  evening_avg_null_referer    evening_std_null_referer    first_15_minute_percent second_15_minute_percent    third_15_minute_percent last_15_minute_percent  brand_MAUI_percent  brand_Nokia_percent brand_Generic_percent   brand_Apple_percent brand_Blackberry_percent    brand_Samsung_percent   brand_SonyEricsson_percent  brand_LG_percent    brand_other_percent avg_per_hour_density    std_per_hour_density    cntr_az_percent cntr_id_percent cntr_in_percent cntr_us_percent cntr_ng_percent cntr_tr_percent cntr_ru_percent cntr_th_percent cntr_sg_percent cntr_uk_percent cntr_other_percent
du3nk   0   1.23    8.47    0   0   0   0   0   0   1.23    8.47    1.24    8.48    1.27    8.61    4.14    11.73   8.73    16.06   0   0   0   0   0   0   8.73    16.06   38.18   240.99  60  248 1.8 10.35   0   0   0   0   0   0   1.8 10.35   0   1   0   0   0   0   0   0   0   0   3360    644 250 61  31  1696    0   0   0   1   0   0   0   1   0   0   0   1   3360    0   644 0   250 0   61  0   31  0   1696    0   0   0   598 0   0   0   0   0   0   0   598 0   0.16    0.17    0.33    0.35    0.01    0   0.05    0   0   0   0   0   0   2   0   0   0   0.13    0   0   0   0   0   0.01    0   0.04
du3nq   1   8.38    5.83    0   0   0   0   0   0   8.38    5.83    25.13   9.27    25.13   9.27    188.5   49.5    188.5   49.5    0   0   0   0   0   0   188.5   49.5    53.86   39.03   188.5   49.5    25.13   9.27    0   0   0   0   0   0   25.13   9.27    1   0   0   0   0   0   0   0   0   0   377 1   1   5   1   8   0   0   0   1   0   0   0   1   0   0   0   1   377 0   1   0   1   0   5   0   1   0   8   0   0   0   0   0   0   0   0   0   0   0   0   0   0.09    0.14    0.33    0.44    0   0   0   1   0   0   0   0   0   2   0   0   0   0   0   0   0   0   0   0   0   1
du3op   0   30.43   46.87   0   0   0   0   44.67   59.63   19.75   30.19   35.5    48.84   35.5    48.84   71  52.27   71  52.27   0   0   0   0   134 0   39.5    33.5    13.31   8.24    71  52.27   35.5    48.84   0   0   0   0   67  62  19.75   30.19   0   0   1   0   0   0   0   0   0   0   213 1   1   6   1   1   0   0   0.63    0.37    0   0   1   1   0   0   1   1   213 0   1   0   1   0   6   0   1   0   1   0   0   0   205 0   0   0   0   0   129 0   76  0   0   0.09    0.25    0.66    0   1   0   0   0   0   0   0   0   3   0   0   0   0   0   0   0   0   0   0   0   1
du3or   0   1   0   0   0   0   0   1   0   1   0   1   0   1   0   1   0   1   0   0   0   0   0   1   0   1   0   1   0   1   0   1   0   0   0   0   0   1   0   1   0   0   1   0   0   0   0   0   0   0   0   2   2   1   1   1   1   0   0   0.5 0.5 0   0   1   1   0   0   1   1   2   0   2   0   1   0   1   0   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0.5 0   0   0.5 0   0   0   0   0   0   1   0   0   2   0   0   1   0   0   0   0   0   0   0   0   0
du3ov   0   1.01    0.11    0   0   0   0   0   0   1.01    0.11    1.01    0.11    1.01    0.11    44  30  29.33   31.63   0   0   0   0   0   0   29.33   31.63   6.29    5.59    44  30  1.02    0.21    0   0   0   0   0   0   1.02    0.21    0   0   0   0   1   0   0   0   0   0   88  1   2   10  1   86  0   0   0   1   0   0   0   1   0   0   0   1   88  0   1   0   2   0   10  0   1   0   86  0   0   0   0   0   0   0   0   0   0   0   0   0   0.84    0   0   0.16    0   0.94    0   0.06    0   0   0   0   0   2   0   0   0   0   0   0   0   0   0   0   0   1
du3ox   0   1   0   0   0   0   0   0   0   1   0   1   0   1   0   1   0   1   0   0   0   0   0   0   0   1   0   1   0   1   0   1   0   0   0   0   0   0   0   1   0   0   1   0   0   0   0   0   0   0   0   1   1   1   1   1   1   0   0   0   1   0   0   0   1   0   0   0   1   1   0   1   0   1   0   1   0   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   1   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   1
du3oy   0   1.02    0.12    0   0   0   0   0   0   1.02    0.12    1.02    0.15    1.02    0.15    64.5    31.5    32.25   35.55   0   0   0   0   0   0   32.25   35.55   7.59    6.03    64.5    31.5    1.03    0.28    0   0   0   0   0   0   1.03    0.28    0   0   0   0   1   0   0   0   0   0   129 1   3   12  1   124 0   0   0   1   0   0   0   1   0   0   0   1   129 0   1   0   3   0   12  0   1   0   124 0   0   0   0   0   0   0   0   0   0   0   0   0   0.26    0.58    0.16    0   0   0.95    0   0.04    0   0   0   0   0   2   0   0   0   0   0   0   0   0   0   0   0   1
du3oz   1   1   0   0   0   0   0   1   0   0   0   1   0   33  3.35    1.01    0.08    165 0   0   0   0   0   165 0   0   0   27.5    8.18    165 0   33  3.35    0   0   0   0   33  3.35    0   0   1   0   0   0   0   0   0   0   0   0   165 164 1   6   1   5   0   0   1   0   0   0   1   0   0   0   1   0   165 0   164 0   1   0   6   0   1   0   5   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   1   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   1   0   0
du3p1   0   1   0   0   0   0   0   1   0   0   0   1   0   18.2    16.11   1.01    0.07    91  80  0   0   0   0   91  80  0   0   15.17   14.82   91  80  18.2    16.11   0   0   0   0   18.2    16.11   0   0   1   0   0   0   0   0   0   0   0   0   182 181 1   6   1   5   0   0   1   0   0   0   1   0   0   0   1   0   182 0   181 0   1   0   6   0   1   0   5   0   0   0   0   0   0   0   0   0   0   0   0   0   0.06    0   0   0.94    0   1   0   0   0   0   0   0   0   2   0   0   0   0   0   0   0   0   0   1   0   0
du3r7   0   3.63    1.32    0   0   0   0   0   0   3.63    1.32    29  0   29  0   29  0   29  0   0   0   0   0   0   0   29  0   3.63    1.32    29  0   29  0   0   0   0   0   0   0   29  0   0   0   0   0   1   0   0   0   0   0   29  1   1   8   1   1   0   0   0   1   0   0   0   1   0   0   0   1   29  0   1   0   1   0   8   0   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   1   0   0   1   0   0   0   0   0   0   0   0   0   0   1   0

can you please share your data in reproducible format? what have you tried? — mtoto, Jan 22 '17 at 19:07
Updating the question with what I've tried and first 25 records of original dataset. — Keyur Golani, Jan 22 '17 at 19:25
We need more information and clearer information. When you say "within one minute time intervals" do you mean for every one minute time interval or do you mean within one minute of every record. (ie: all duplicates within 0min-1min, 1min-2min, etc ..vs.. record 0 at 59s and record 1 at 1min1s are within a minute of each other). May I suggest providing a simple example of input and output you're expecting. — gnicholas, Jan 22 '17 at 22:22
Ok. 1. Ideally according to my application I think case 2 would be ideal. But however, I suppose it would become too complicated. Hence, I am looking for Case 1 right now. 2. I do have a sample output but it would not be relevant because to relate output, you will need whole input file which is 600MB csv file. However I am updating question with sample output records. — Keyur Golani, Jan 22 '17 at 23:57

Zhang Tong · Accepted Answer · 2017-01-23T06:33:58.667

Init:

from pyspark import *
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql import functions as f

This equal to your 'first 25 records of my original dataset file'.

df = spark.read.load(path="file:///home/zht/PycharmProjects/test/disk_file", format='csv', sep=',', header=True)

Just to get a remarkable result, this step can be ignore

df = df.withColumn('iplong', f.substring('iplong', pos=0, len=1)) \
    .withColumn('agent', f.substring('agent', pos=0, len=1)) \
    .withColumn('client', f.substring('client', pos=0, len=2)) \
    .withColumn('partner', f.substring('partner', pos=0, len=2)) \
    .withColumn('timestamp',f.when(f.substring('id', pos=6, len=1) % 2 == 1, '2012-03-08 00:01:00.0').otherwise(df['timestamp']))
df.show()

+-------+------+-----+-------+------+-------+--------------------+--------+--------------------+
|     id|iplong|agent|partner|client|country|           timestamp|category|           reference|
+-------+------+-----+-------+------+-------+--------------------+--------+--------------------+
|9794476|     1|    S|     dv|    ds|     us|2012-03-08 00:01:...|      ad|                null|
|9794474|     1|    S|     dv|    ds|     in|2012-03-08 00:01:...|      mg|riflql2a0yv8xoa9s...|
|9794471|     3|    N|     du|    dr|     py|2012-03-08 00:01:...|      co|                null|
|9794468|     1|    N|     dv|    ds|     vn|2012-03-08 00:00:...|      es|gp53lqr9njqd6z2ap...|
|9794467|     1|    M|     du|    dv|     in|2012-03-08 00:00:...|      ad|                null|
|9794466|     1|    N|     du|    dv|     in|2012-03-08 00:00:...|      ad|                null|
|9794477|     1|    B|     du|    dt|     es|2012-03-08 00:01:...|      es|h5njsswvxorsau9u8...|
|9794478|     1|    N|     du|    ds|     ru|2012-03-08 00:01:...|      mc|                null|
|9794481|     1|    N|     dv|    dr|     th|2012-03-08 00:00:...|      mc|oj0rekb51pvirnjuq...|
|9794482|     1|    N|     du|    dr|     id|2012-03-08 00:00:...|      co|r63f8uhijvr2irvka...|
|9794483|     1|    M|     dv|    dv|     id|2012-03-08 00:00:...|      ad|                null|
|9794485|     2|    G|     dv|    dr|     th|2012-03-08 00:00:...|      co|                null|
|9794486|     3|    M|     dv|    ds|     za|2012-03-08 00:00:...|      es|                null|
|9794492|     7|    N|     dv|    dt|     ng|2012-03-08 00:01:...|      es|onbw7na2mi8a62g4p...|
|9794493|     6|    N|     du|    dv|     sd|2012-03-08 00:01:...|      ad|hoq05psulkszxm4iz...|
|9794495|     1|    S|     dv|    dv|     in|2012-03-08 00:01:...|      co|im387req0zp1ucyga...|
|9794496|     1|    G|     du|    ds|     in|2012-03-08 00:01:...|      mc|immfap8948rebeym8...|
|9794498|     1|    S|     du|    ds|     in|2012-03-08 00:01:...|      mc|r81nrzjemr5jrfvjj...|
|9794499|     1|    N|     dv|    dr|     au|2012-03-08 00:01:...|      ad|                null|
|9794500|     1|    N|     dv|    dr|     id|2012-03-08 00:00:...|      co|tq09jycwii12iul7h...|
+-------+------+-----+-------+------+-------+--------------------+--------+--------------------+

And the key operation:

res = df.groupBy([f.window('timestamp', windowDuration='1 minutes'),'partner', 'iplong', 'agent']).count()
res = res.withColumn('total',f.sum('count').over(Window.partitionBy(["window", "partner"])))
res.show(n=30, truncate=False)

+---------------------------------------------+-------+------+-----+-----+-----+
|window                                       |partner|iplong|agent|count|total|
+---------------------------------------------+-------+------+-----+-----+-----+
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |1     |N    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |3     |N    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |3     |S    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |6     |N    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |1     |B    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |1     |G    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |1     |S    |1    |7    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |3     |M    |1    |8    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |1     |N    |3    |8    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |2     |G    |1    |8    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |1     |G    |1    |8    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |1     |M    |1    |8    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |1     |S    |1    |8    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|dv     |3     |S    |1    |6    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|dv     |7     |N    |1    |6    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|dv     |1     |S    |3    |6    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|dv     |1     |N    |1    |6    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|du     |2     |S    |1    |4    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|du     |1     |M    |1    |4    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|du     |1     |N    |2    |4    |
+---------------------------------------------+-------+------+-----+-----+-----+

Column count means the amount of records per every 1 min & partner & iplong & agent

Column total means the amount of records per every 1 min & partner

Do you mean this ?

The key operation output (The last table printed) is each record in it corresponding to a partnerid? If so, then this is really a big help and this is what I needed to go in the right direction. It is not the exact output I need. But it serves the purpose as to point me in the right direction. — Keyur Golani, Jan 23 '17 at 06:21
Yes. I wanted various different combinations of columns than "per every 1 min & partner & iplong & agent" but this is more than what I could expect. This would be helpful to me in getting the desired results. Thanks a lot man!!! I really appreciate the effort. — Keyur Golani, Jan 23 '17 at 08:13
I noticed that you are using pyspark in pycharm. I recently started working on it. And I am able to access pyspark module into pycharm by setting some variables from code. But I am not able to do the same by setting the environment variables directly into the OS. Can you please take a look at my question to see if you can help? http://stackoverflow.com/questions/41735502/spark-installation-and-configuration-on-macos-importerror-no-module-named-pyspa — Keyur Golani, Jan 23 '17 at 18:14

Count Duplicates Values within a time interval in PySpark

1 Answers1