Need to exclude 1M records using regex match for a code flow with best algorithm

Question

I have to exclude 1M records(let say record is phone number) for a code flow.

Currently I am thinking to store all these records in an array and check in this array whether current record is startWith(regex) any of the stored value in array.

Time complexity for this method is O(n). Can i do this in more efficient way?

Why `startWith`? You don't care about characters that follow after the match? Is your regex one huge regex to match all of the one million values? Or do you have one million regexes? Some examples would be helpful to understand what you are doing. — trincot, Mar 12 '19 at 10:33
I agree that a bit more clarity on the problem at hand -- with some examples? -- would be helpful. — James Davis, Mar 12 '19 at 13:11

score 0 · Answer 1 · answered Mar 12 '19 at 13:09

Complexity

Your task sounds like it is at least O(N) where N is the number of records you have to process. If you are filtering out N of the M total records (e.g. 1M records out of 1B total, then it will be O(M) unless the M records are already indexed or stored in an appropriate data structure (e.g. a trie).

Wall clock time

The good news is that processing 1M records using string operations should take relatively little wall clock time (seconds at worst).

I recommend you extract 10K records and time your analysis on those in order to get a rough estimate of the total processing cost.

Caveats

A regex is a good filtering mechanism if the "language" of valid (or invalid) records is well defined.

If you are applying a regex on unsanitized client data, beware of Regular Expression Denial of Service. See my post here for details.

Need to exclude 1M records using regex match for a code flow with best algorithm

1 Answers1

Complexity

Wall clock time

Caveats