I have in a Spark data frame with 10 million rows, where each row represents an alpha numeric string indicating id of a user, example:
602d38c9-7077-4ea1-bc8d-af5c965b4e85
my objective is to check if another id like aaad38c9-7087-4ef1-bc8d-af5c965b4e85
is present in the 10 million list.
I would want to do it efficiently and not search all 10 million records, every single time a search happens. Example can I sort my records alphabetically and ask SparkR to search only within records that begin with a
instead of the universe to speed up search and make it computationally efficient?
Any solutions primarily using SparkR if not then any Spark solution would be helpful