Getting the token sequence number after regex matching in python

Question

I want to find all the elements in a list that match a regex. To decrease the number of times regex matching is done, I created a string by joining the elements delimited with a space, as given below:

list_a = ["4123", "7648", "afjsdn", "ujaf", "huh23"]
regex_num = r"\d+"
string_a = " ".join(list_a)
num_matches = re.findall(regex_num, string_a)

The list and the matches are as given below:

list_a:  ['4123', '7648', 'afjsdn', 'ujaf', 'huh23']
matches:  ['4123', '7648', '23']

Now that I have all my matches I want to know whether the match was part of the element/token or an entire token. One way I can do this is by comparing the match with the actual token/element:

"23" == "huh23"
False

But to do this, I would require the token serial number. Which isn't available directly. The only position information regex matching can provide is the span of the match which is at a character level.

The other path I could take is to just apply regex matching for all the elements by looping through the list and comparing the string with the match if there is a match.

I would like to reduce as much time complexity as possible for this operation.

Is there a more pythonic way of determining whether a match is just a part of the token or is there a more pythonic way to find the serial number of the matched word so that the initial list could be exploited for string comparison?

Any help would be appreciated. Thanks in advance!

Edit 1:

If my list is something like:

list_a = ["4123", "7648", "afjsdn", "ujaf", "huh23", "n23kl3l24"] like suggested by @Artyom Vancyan in the comments

The output I would like is:

matches_with_slno = [[0,'4123'], [1,'7648'], [4, '23'], [5, '23'], [5,'3'], [5, '24']

Hi! Do you really need the `join` part? Couldn't be possible to apply the search pattern on list directly, term by term? — cards, Jun 17 '22 at 08:34
Yes, that is what I plan to do if there isn't another quicker and more convenient way. — Suneha K S, Jun 17 '22 at 08:36
@cards suggestion sounds like the quicker more convenient way to me — Anentropic, Jun 17 '22 at 08:40
@ArtyomVancyan 23, 3 and 24 should match. And for each of these, I would like to know that it is a part of the token and the serial number of that token, which in this case will be the same. — Suneha K S, Jun 17 '22 at 08:44
Like separate items of a list? I mean if your list is `['4123', 'ujaf', 'n23kl3l24']` the output should be `['4123', '23', '3', '24']`? — Artyom Vancyan, Jun 17 '22 at 08:47

Artyom Vancyan · Answer 1 · 2022-06-17T11:15:55.553

2

Using `yield from`

The most pythonic solution I would recommend is mixing enumerate with a generator.

import re

arr = ['4123', '7648', 'afjsdn', 'ujaf', 'huh23', 'n23kl3l24']


def process(array):
    for index, item in enumerate(array):
        yield from [[index, match] for match in re.findall(r"\d+", item)]


print(list(process(arr)))  # [[0, '4123'], [1, '7648'], [4, '23'], [5, '23'], [5, '3'], [5, '24']]

One of the usages of yield from is list flattening. Also, yield from cannot be used in a list comprehension; otherwise, we would have one line code. And we use enumerate to have an element's serial index (number). As yield is used, the process function becomes a generator.

NOTE: In the generator implementation, we use a loop and list comprehension as well.

Using list comprehension

import re

arr = ['4123', '7648', 'afjsdn', 'ujaf', 'huh23', 'n23kl3l24']

print([[index, match] for index, item in enumerate(arr) for match in re.findall(r"\d+", item)])  # [[0, '4123'], [1, '7648'], [4, '23'], [5, '23'], [5, '3'], [5, '24']]

edited Jun 17 '22 at 11:15

answered Jun 17 '22 at 08:59

Artyom Vancyan

5,029
3
12
34

Thanks for the clarification :) I wanted to understand the purpose of the generator here. It is also possible to use the for loop inside the generator directly, without making the calls like: `output = [[index, match] for match in re.findall(r"\d+", item) for index, item in enumerate(arr)]`, right? Am I missing something here? Please let me know. – Suneha K S Jun 17 '22 at 10:52
Yes, we have to use a generator for using `yield from` that allows us to flatten the list which is more pythonic than having a list in a function, append something on every iteration of the loop and return the list. – Artyom Vancyan Jun 17 '22 at 10:58
Can't we just use two for loops here? as a part of the list comprehension itself? – Suneha K S Jun 17 '22 at 11:03
Sure we can, but as I said in my previous comment, it needs an additional list to hold results. – Artyom Vancyan Jun 17 '22 at 11:05
1

`[[index, match] for index, item in enumerate(arr) for match in re.findall(r"\d+", item)]` This code is giving me the same results: `[[0, '4123'], [1, '7648'], [4, '23'], [5, '23'], [5, '3'], [5, '24']]` – Suneha K S Jun 17 '22 at 11:11
I just wanted to share that with you :) – Artyom Vancyan Jun 17 '22 at 11:13
Yes, it is also a good solution. – Artyom Vancyan Jun 17 '22 at 11:14

cards · Accepted Answer · 2022-06-17T14:02:59.613

A functional programming approach using itertools:

from itertools import chain, zip_longest

list_a = ["4123", "7648", "afjsdn", "ujaf", "huh23", "n23kl3l24"]

regex = r'\d+'
# pre-compilation of the regex
pattern = re.compile(regex)

# group the matches
grps = ((i, m) for i, m in enumerate(chain(map(pattern.findall, list_a))) if m)

# coupling the groups + flatting
out = list(chain.from_iterable(tuple(zip_longest((i,), matches, fillvalue=i)) for i, matches in grps))
print(out)
#[(0, '4123'), (1, '7648'), (4, '23'), (5, '23'), (5, '3'), (5, '24')]

Performace test

I tried to make a fair test: each answer has two implementations:

v1: with regex-precompilation, pattern.findall
v2: without precompilation, re.findall

The test's parameter are the amount of repetition (to get a statistically stable result) and a length factor to stretch the size of the list of data.

In all cases is evedent that the precompilation makes the code more performant. The list-comprehension seems to be the better solution, then the generator, last itertool.

At 1 significant figures, for small data, all of them work the same.

Time measurement done with timeit.

from timeit import timeit
from itertools import chain, zip_longest
import re

def with_itertools_v1(lst):
    grps = ((i, m) for i, m in enumerate(chain(map(lambda x: re.findall(r'\d+', x), lst))) if m)
    return list(chain.from_iterable(list(zip_longest((i,), matches, fillvalue=i)) for i, matches in grps))
    
def with_itertools_v2(lst):
    pattern = re.compile(r'\d+')
    grps = ((i, m) for i, m in enumerate(chain(map(pattern.findall, lst))) if m)
    return list(chain.from_iterable(list(zip_longest((i,), matches, fillvalue=i)) for i, matches in grps))


def with_generators_v1(lst):
    def gen():
        for index, item in enumerate(lst):
            yield from [[index, match] for match in re.findall(r"\d+", item)]
    return list(gen())


def with_generators_v2(lst):
    def gen():
        pattern = re.compile(r'\d+')
        for index, item in enumerate(lst):
            yield from [[index, match] for match in pattern.findall(item)]
    return list(gen())

def with_list_comprehension_v1(lst):
    return [[index, match] for index, item in enumerate(lst) for match in re.findall(r"\d+", item)]


def with_list_comprehension_v2(lst):
    pattern = re.compile(r'\d+')
    return [[index, match] for index, item in enumerate(lst) for match in pattern.findall(item)]


funcs = (with_itertools_v1, with_itertools_v2, with_generators_v1, with_generators_v2, with_list_comprehension_v1, with_list_comprehension_v2)


def tester(repetition=5, length_factor=1):
    data = ["4123", "7648", "afjsdn", "ujaf", "huh23", "n23kl3l24"] * length_factor
    print(f'Test with repeated {repetition}-times with data with {len(data)*length_factor} terms')
    for func in funcs:
        t = timeit(lambda: func(data), number=repetition)
        print(func.__name__, f'ms: {t / repetition * 1e3:3f}')
    print()


tester(5, 1)
tester(5, 10)
tester(5, 100)
tester(5, 1000)

Results

Test with repeated 5-times with data with 6 terms
with_itertools_v1 ms: 0.060872
with_itertools_v2 ms: 0.019029
with_generators_v1 ms: 0.023321
with_generators_v2 ms: 0.015797
with_list_comprehension_v1 ms: 0.017898
with_list_comprehension_v2 ms: 0.010917

Test with repeated 5-times with data with 600 terms
with_itertools_v1 ms: 0.248608
with_itertools_v2 ms: 0.142064
with_generators_v1 ms: 0.201945
with_generators_v2 ms: 0.116305
with_list_comprehension_v1 ms: 0.160765
with_list_comprehension_v2 ms: 0.076942

Test with repeated 5-times with data with 60000 terms
with_itertools_v1 ms: 2.165053
with_itertools_v2 ms: 1.612787
with_generators_v1 ms: 1.484747
with_generators_v2 ms: 0.639824
with_list_comprehension_v1 ms: 11.219491
with_list_comprehension_v2 ms: 0.809310

Test with repeated 5-times with data with 6000000 terms
with_itertools_v1 ms: 46.531748
with_itertools_v2 ms: 35.922768
with_generators_v1 ms: 77.271880
with_generators_v2 ms: 20.042708
with_list_comprehension_v1 ms: 22.927475
with_list_comprehension_v2 ms: 16.760901

thanks for the answer, I also wanted to understand how this approach is superior compared to nested for loops in list comprehension, like `output = [[index, match] for match in re.findall(r"\d+", item) for index, item in enumerate(arr)]`. In terms of time complexity — Suneha K S, Jun 17 '22 at 11:06
@Suneha K S now edited with right output. I don't know about time complexity but `itertools` provides _"fast, memory efficient tools"_ . One should try to make a time test of the solutions (as well the nested comprehension) and check the results — cards, Jun 17 '22 at 12:32
Thanks for the solution. Would have loved to have the results of such tests. — Suneha K S, Jun 17 '22 at 12:41
@Suneha K S I am working on the time-test. At the moment I can say that the list-comprehension is better. BUT to improve performance you have to do the precompilation trick of the regex, see my answer, this trastical decreases the execution time!! — cards, Jun 17 '22 at 13:14
that is very interesting. Appreciate the extra effort. Thanks a lot for getting back to me on this! If you could post the comparative time-test results, your answer will be more valuable irrespective of the solution with the best time complexity. — Suneha K S, Jun 17 '22 at 13:22
I will retry the test using another approach just to be sure. I will also add some new functions — cards, Jun 17 '22 at 14:03

Getting the token sequence number after regex matching in python

2 Answers2

Using yield from

Using list comprehension

Using `yield from`