How to optimize searching RDD contents in another RDD

Question

I have a problem where I need to search the contents of an RDD in another RDD.

This question is different from Efficient string matching in Apache Spark, as I am searching for an exact match and I don't need the overhead of using the ML stack.

I am new to spark and I want to know which of these methods is more efficient or if there is another way. I have a keyword file like the below sample (in production it might reach up to 200 lines)

Sample keywords file

0.47uF 25V X7R 10% -TDK C2012X7R1E474K125AA
20pF-50V NPO/COG - AVX- 08055A200JAT2A

and I have another file(tab separated)from which I need to find matches(in production I have up to 80 Million line)

C2012X7R1E474K125AA Conn M12 Circular PIN 5 POS Screw ST Cable Mount 5 Terminal 1 Port

First method I defined a UDF and looped through keywords for each line

keywords = sc.textFile("keys")
part_description = sc.textFile("part_description")


def build_regex(keywords):
    res = '('
    for key in keywords:
        res += '(?<!\\\s)%s(?!\\\s)|' % re.escape(key)
    res = res[0:len(res) - 1] + ')'
    return r'%s' % res


def get_matching_string(line, regex):
    matches = re.findall(regex, line, re.IGNORECASE)
    matches = list(set(matches))
    return list(set(matches)) if matches else None


def find_matching_regex(line):
    result = list()
    for keyGroup in keys:
        matches = get_matching_string(line, keyGroup)
        if matches:
            result.append(str(keyGroup) + '~~' + str(matches) + '~~' + str(len(matches)))
    if len(result) > 0:
        return result


def split_row(list):
    try:
        return Row(list[0], list[1])
    except:
        return None


keys_rdd = keywords.map(lambda keywords: build_regex(keywords.replace(',', ' ').replace('-', ' ').split(' ')))
keys = keys_rdd.collect()

sc.broadcast(keys)

part_description = part_description.map(lambda item: item.split('\t'))
df = part_description.map(lambda list: split_row(list)).filter(lambda x: x).toDF(
    ["part_number", "description"])

find_regex = udf(lambda line: find_matching_regex(line), ArrayType(StringType()))

df = df.withColumn('matched', find_regex(df['part_number']))

df = df.filter(df.matched.isNotNull())

df.write.save(path=job_id, format='csv', mode='append', sep='\t')

Second method I thought I can do more parallel processing (instead of looping through keys like above) I did cartersian product between keys and lines, splitted and exploded the keys then compared each key to the part column

df = part_description.cartesian(keywords)

    df = df.map(lambda tuple: (tuple[0].split('\t'), tuple[1])).map(
        lambda tuple: (tuple[0][0], tuple[0][1], tuple[1]))

    df = df.toDF(['part_number', 'description', 'keywords'])

    df = df.withColumn('single_keyword', explode(split(F.col("keywords"), "\s+"))).where('keywords != ""')

    df = df.withColumn('matched_part_number', (df['part_number'] == df['single_keyword']))

    df = df.filter(df['matched_part_number'] == F.lit(True))

    df.write.save(path='part_number_search', format='csv', mode='append', sep='\t')

Are these the correct ways to do this? Is there anything I can do to process these data faster?

Possible duplicate of [Efficient string matching in Apache Spark](https://stackoverflow.com/questions/43938672/efficient-string-matching-in-apache-spark) — user10938362, Jun 23 '19 at 10:44
@user10938362, I can see the suggested question is about OCR and pyspark ML, to get similar string, not sure how it's applicable to my problem. — Exorcismus, Jun 23 '19 at 10:49

score 1 · Answer 1 · answered Jun 24 '19 at 15:27

These are both valid solutions, and I have used both in different circumstances.

You communicate less data by using your broadcast approach, sending only 200 extra lines to each executor as opposed to replicating each line of your >80m line file 200 times, so it is likely this one will end up being faster for you.

I have used the cartesian approach when the number of records in my lookup is not feasibly broadcast-able (being much, much larger than 200 lines).

In your situation, I would use broadcast.

How to optimize searching RDD contents in another RDD

1 Answers1