What is the algorithmic complexity of tokenizing with regular expressions?

Question

from nltk.tokenize import RegexpTokenizer
s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokenizer.tokenize(s)

Would this code be considered O(n)?

Based on what I read from the NLTK documentation, "a RegexpTokenizer splits a string into substrings using a regular expression". I'm assuming that using a regular expression to mach against a string would be O(1), and then splitting the string into substrings with tokenizer.tokenize(s) would be O(n) where n is the number of characters in the input. Thank you for any clarification.

Possible duplicate of [What is the complexity of regular expression?](https://stackoverflow.com/questions/4378455/what-is-the-complexity-of-regular-expression) — Johan, Oct 21 '18 at 21:22
Depending on the regular expression and text to match the matching process can have different time complexity. It was the case that a simple mistake in a regular expression here on SO brought the system down when someone posted code with many spaces. The matching process got too complex for the system to handle. — Klaus D., Oct 21 '18 at 21:26
@KlausD. Assuming that the text is very simple such as "Hello World", and not overly complicated, what would you consider the time-complexity to be? And thank you about the clarification that it depends on the regular expression & text, I did not take that into consideration. — smith1453, Oct 21 '18 at 21:29

score 1 · Accepted Answer · answered Oct 21 '18 at 22:10

I would argue that this code would be O(n) or Big-O of n.

The largest factor in your code determining its run time is the size of the string being searched through by the Regex. The other lines would be considered constants, or O(1)

Say if your regex was to be searching a piece of text a hundred times longer, then that text would be the overwhelming factor when deciding on the time complexity.

What is the algorithmic complexity of tokenizing with regular expressions?

1 Answers1