0

I'm trying to process URL's in a pyspark dataframe using a class that I've written and a udf. I'm aware of urllib and other url parsing libraries but for this case I need to use my own code.

In order to get the tld of a url I cross check it against the iana public suffix list.

Here's a simplification of my code

class Parser:

    # list of available public suffixes for extracting top level domains
    file = open("public_suffix_list.txt", 'r')

    data = []
    for line in file:
        if line.startswith("//") or line == '\n':
            pass
        else:
            data.append(line.strip('\n'))

    def __init__(self, url):
        self.url = url

        #the code here extracts port,protocol,query etc.

        #I think this bit below is causing the error
        matches = [r for r in self.data if r in self.hostname] 

        #extra functionality in my actual class

        i = matches.index(self.string)

        try:
            self.tld = matches[i]

        # logic to find tld if no match

The class works in pure python so for example I can run

import Parser

x = Parser("www.google.com")
x.tld #returns ".com"

However when I try to do

import Parser
from pyspark.sql.functions import udf

parse = udf(lambda x: Parser(x).url)

df = sqlContext.table("tablename").select(parse("column"))

When I call an action I get

  File "<stdin>", line 3, in <lambda>
  File "<stdin>", line 27, in __init__
TypeError: 'in <string>' requires string as left operand

So my guess is that it's failing to interpret the data as a list of strings?

I've also tried to use

file = sc.textFile("my_file.txt")\
         .filter(lambda x: not x.startswith("//") or != "")\
         .collect()

data = sc.broadcast(file)

to open my file instead, but that causes

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Any ideas?

Thanks in advance

EDIT: Apologies, I didn't have my code to hand so my test code didn't explain very well the problems I was having. The error I initially reported was a result of the test data I was using.

I've updated my question to be more reflective of the challenge I'm facing.

johnaphun
  • 644
  • 6
  • 16

1 Answers1

0

Why do you need a class in this case (the code for defining your class is incorrect, you never declared self.data before using it in the init method) the only relevant line that affects the output you want is self.string=string, so you are basically passing the identity function as udf.

The UnicodeDecodeError is due to an encoding issue in your file, it has nothing to do with your definition of the class.

The second error is in the line sc.broadcast(file) , details of which can be found here : Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion

EDIT 1

I would redefine your class structure as follows. You basically need to create the instance self.data by calling self.data = data before you can use it. Also anything that you write before the init method is executed irrespective of whether you call that class or not. So moving out the file parsing part will not have any effect.

# list of available public suffixes for extracting top level domains
file = open("public_suffix_list.txt", 'r')

data = []
for line in file:
    if line.startswith("//") or line == '\n':
        pass
    else:
        data.append(line.strip('\n'))

class Parser:
    def __init__(self, url):
        self.url = url
        self.data = data 

        #the code here extracts port,protocol,query etc.

        #I think this bit below is causing the error
        matches = [r for r in self.data if r in self.hostname] 

        #extra functionality in my actual class

        i = matches.index(self.string)

        try:
            self.tld = matches[i]

        # logic to find tld if no match
Community
  • 1
  • 1
Gaurav Dhama
  • 1,346
  • 8
  • 19
  • I've added some clarification to my original question. I'm using classes as I'm parsing a url, I thought it made sense to treat the url as an object and treat the features of the url as instance variables. – johnaphun Feb 17 '17 at 10:06
  • No that still produces the same bug. I did have some luck in Zeppelin by using sc.textFile to read in my data and then broadcasting. However when submitting my application I run into trouble when importing my class because I cant import a spark context where one is already defined. @gaurav-dhama – johnaphun Feb 20 '17 at 11:34