I have a text corpus of 11 files each having about 190000 lines. I have 10 strings one or more of which may appear in each line the above corpus.
When I encounter any of the 10 strings, I need to record that string which appears in the line separately. The brute force way of looping through the regex for every line and marking it is taking a long time. Is there an efficient way of doing this?
I found a post (Match a line with multiple regex using Python) which provides a TRUE or FALSE output. But how do I record the matching regex from the line:
any(regex.match(line) for regex in [regex1, regex2, regex3])
Edit: adding example
regex = ['quick','brown','fox']
line1 = "quick brown fox jumps on the lazy dog" # i need to be able to record all of quick, brown and fox
line2 = "quick dog and brown rabbit ran together" # i should record quick and brown
line3 = "fox was quick an rabit was slow" # i should be able to record quick and fox.
Looping through the regex and recording the matching one is one of the solutions, but looking at the scale (11 * 190000 * 10), my script is running for a while now. i need to repeat this in my work quite many times. so i was looking at a more efficient way.