0

I have in a Spark data frame with 10 million rows, where each row represents an alpha numeric string indicating id of a user, example: 602d38c9-7077-4ea1-bc8d-af5c965b4e85 my objective is to check if another id like aaad38c9-7087-4ef1-bc8d-af5c965b4e85 is present in the 10 million list.

I would want to do it efficiently and not search all 10 million records, every single time a search happens. Example can I sort my records alphabetically and ask SparkR to search only within records that begin with a instead of the universe to speed up search and make it computationally efficient?

Any solutions primarily using SparkR if not then any Spark solution would be helpful

Anurag H
  • 909
  • 11
  • 28

1 Answers1

-1

You can use rlike which is for regex search within a dataframe column.

df.filter($"foo".rlike("regex"))

Or You can index spark dataframe into solr which will definitely search your string within few milliseconds. https://github.com/lucidworks/spark-solr

Aniket Rangrej
  • 192
  • 1
  • 7
  • `filter` doesn't really satisfy the condition to _do it efficiently and not search all 10 million records_, does it? – zero323 Jun 04 '18 at 09:22