convert sparql query to sql query in pyspark-dataframe

Question

I have dataset that consist of three columns subject, predicate, and object

subject      predicate    object
   c1            B          V3
   c1            A          V3
   c1            T          V2
   c2            A          V2
   c2            A          V3
   c2            T          V1
   c2            B          V3
   c3            B          V3
   c3            A          V3
   c3            T          V1
   c4            A          V3
   c4            T          V1
   c5            B          V3
   c5            T          V2
   c6            B          V3
   c6            T          V1

I want to apply association mining rules on this data by using sql queries. I take this idea from this paper Association Rule Mining on Semantic data by using sparql(SAG algorithm)

first, the user has to specify T (target predicate) and minimum support,then query if this T is frequent or not:

SELECT ?pt ?ot (COUNT(*) AS ?Yent)
WHERE {?s ?pt ?ot.
FILTER (regex (str(?pt), 'T', 'i'».}
GROUP BY ?pt ?ot
HAVING (?Yent >= 2)

I tried following code and I got same result:

q=mtcars1.select('s','p','o').where(mtcars1['p']=='T')
q1=q.groupBy('p','o').count()
q1.filter(q1['count']>=2).show()

result

+---+---+-----+
|  p|  o|count|
+---+---+-----+
|  T| V2|    2|
|  T| V1|    4|
+---+---+-----+

second query to calculate other predicates and objects if they are frequent:

q2=mtcars1.select('s','p','o').where(mtcars1['p']!='T')
q3=q2.groupBy('p','o').count()
q3.filter(q3['count']>=2).show()

result

+---+---+-----+
|  p|  o|count|
+---+---+-----+
|  A| V3|    4|
|  B| V3|    5|
+---+---+-----+

in order to find rules between two above queries, we will scan dataset again and find if they are repeated together greater than or equal minimum support

SELECT ?pe ?oe ?pt ?ot (count(*) AS ?supCNT)
WHERE { ?s ?pt ?ot .
FILTER (regex (str(?pt), 'T','i'».
?s ?pe ?oe .
FILTER (!regex (str(?pe), 'T','i'».}
GROUP BY ?pe ?oe ?pt ?ot
HAVING (?supCNT >= I)
ORDER BY ?pt ?ot

I tried to store subject in list then join between items ,but this took long time, and this will take very long time if data is very large.

w=mtcars1.select('s','p' ,'o').where(mtcars1['p']=='T')
w1=w.groupBy('p','o').agg(collect_list('s')).show()

result

+---+---+----------------+
|  p|  o| collect_list(s)|
+---+---+----------------+
|  T| V2|        [c1, c5]|
|  T| V1|[c2, c3, c4, c6]|
+---+---+----------------+
w2=mtcars1.select('s','p' ,'o').where(mtcars1['p']!='T')
w3=w2.groupBy('p','o').agg(collect_list('s')).show()

result

+---+---+--------------------+
|  p|  o|     collect_list(s)|
+---+---+--------------------+
|  A| V3|    [c1, c2, c3, c4]|
|  B| V3|[c1, c2, c3, c5, c6]|
|  A| V2|                [c2]|
+---+---+--------------------+

join code

    from pyspark.sql.functions import *
w44=w1.alias("l")\
    .crossJoin(w3.alias("r"))\
    .select(
        f.col('l.p').alias('lp'),
        f.col('l.o').alias('lo'),
        f.col('r.p').alias('rp'),
        f.col('r.o').alias('ro'),
        intersection_udf(f.col('l.collect_list(s)'), f.col('r.collect_list(s)')).alias('TID'),
        intersection_length_udf(f.col('l.collect_list(s)'), f.col('r.collect_list(s)')).alias('len')
    )\
    .where(f.col('len') > 1)\
    .select(
        f.struct(f.struct('lp', 'lo'), f.struct('rp', 'ro')).alias('2-Itemset'), 
        'TID'
    )\
    .show()

result

+---------------+------------+
|      2-Itemset|         TID|
+---------------+------------+
|[[T,V2],[B,V3]]|    [c1, c5]|
|[[T,V1],[A,V3]]|[c3, c2, c4]|
|[[T,V1],[B,V3]]|[c3, c2, c6]|
+---------------+------------+

so,I have to re scan dataset again and find association rules between items, and re scan again to find again rules. following query is used to construct 3-factor set:

SELECT ?pel ?oel ?pe2 ?oe2 ?pt ?ot (eount(*) AS
?supCNT)
WHERE { ?s ?pt ?ot .
FILTER (regex (str(?pt), 'T','i'».
?s ?pel ?oel .
FILTER (!regex (str(?pel), 'T','i'».
FILTER (!regex (str(?pc2), 'T','i')&& !regex
(str(?pc2), str(?pcl),'i') ).}
GROUP BY ?pcl ?ocl ?pc2 ?oc2 ?pt ?ot
HAVING (?supCNT >=2)
ORDER BY ?pt ?ot

result for this query should be

{[(A, V3) (B, V3) (T, V1), 2]}

and we will repeat queries until no other rules between items can anyone help me how can make association rules by sql queries,where subject is used as ID ,predicate + object=items

Please read [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) and try to provide a [mcve]. Tell us why your current code isn't working and what you'd like the desired output to be. — pault, Feb 14 '18 at 21:24
Providing the .explain() plan for the query would be helpful. Screen shots from the spark UI is also helpful for tuning. Apply your filters as early as possible. Cross join is going to be expensive operation. — Douglas M, Nov 03 '22 at 18:20

convert sparql query to sql query in pyspark-dataframe

0 Answers0

Linked