Efficient alphanumeric searching sparkR

Question

I have in a Spark data frame with 10 million rows, where each row represents an alpha numeric string indicating id of a user, example: 602d38c9-7077-4ea1-bc8d-af5c965b4e85 my objective is to check if another id like aaad38c9-7087-4ef1-bc8d-af5c965b4e85 is present in the 10 million list.

I would want to do it efficiently and not search all 10 million records, every single time a search happens. Example can I sort my records alphabetically and ask SparkR to search only within records that begin with a instead of the universe to speed up search and make it computationally efficient?

Any solutions primarily using SparkR if not then any Spark solution would be helpful

Your requirements indicate you need rather need a search engine than ETL tool like Spark. — Alper t. Turker, Jun 04 '18 at 08:02
Will partitioning help? If yes can you please share a directional code? — Anurag H, Jun 04 '18 at 11:33
[Efficient string matching in Apache Spark](https://stackoverflow.com/q/43938672/8371915) — Alper t. Turker, Jun 04 '18 at 12:15

score -1 · Answer 1 · answered Jun 04 '18 at 08:22

-1

You can use rlike which is for regex search within a dataframe column.

df.filter($"foo".rlike("regex"))

Or You can index spark dataframe into solr which will definitely search your string within few milliseconds. https://github.com/lucidworks/spark-solr

answered Jun 04 '18 at 08:22

Aniket Rangrej

192
1
7

`filter` doesn't really satisfy the condition to _do it efficiently and not search all 10 million records_, does it? – zero323 Jun 04 '18 at 09:22

Efficient alphanumeric searching sparkR

1 Answers1