18

I am using Python and would like to match all the words after test till a period (full-stop) or space is encountered.

text = "test : match this."

At the moment, I am using :

import re
re.match('(?<=test :).*',text)

The above code doesn't match anything. I need match this as my output.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Amith
  • 351
  • 1
  • 2
  • 8

4 Answers4

49

Everything after test, including test

test.*

Everything after test, without test

(?<=test).*

Example here on regexr.com

Alex
  • 114
  • 1
  • 6
Punnerud
  • 7,195
  • 2
  • 54
  • 44
19

You need to use re.search since re.match tries to match from the beging of the string. To match until a space or period is encountered.

re.search(r'(?<=test :)[^.\s]*',text)

To match all the chars until a period is encountered,

re.search(r'(?<=test :)[^.]*',text)
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
6

In a general case, as the title mentions, you may capture with (.*) pattern any 0 or more chars other than newline after any pattern(s) you want:

import re
p = re.compile(r'test\s*:\s*(.*)')
s = "test : match this."
m = p.search(s)           # Run a regex search anywhere inside a string
if m:                     # If there is a match
    print(m.group(1))     # Print Group 1 value

If you want . to match across multiple lines, compile the regex with re.DOTALL or re.S flag (or add (?s) before the pattern):

p = re.compile(r'test\s*:\s*(.*)', re.DOTALL)
p = re.compile(r'(?s)test\s*:\s*(.*)')

However, it will retrun match this.. See also a regex demo.

You can add \. pattern after (.*) to make the regex engine stop before the last . on that line:

test\s*:\s*(.*)\.

Watch out for re.match() since it will only look for a match at the beginning of the string (Avinash aleady pointed that out, but it is a very important note!)

See the regex demo and a sample Python code snippet:

import re
p = re.compile(r'test\s*:\s*(.*)\.')
s = "test : match this."
m = p.search(s)           # Run a regex search anywhere inside a string
if m:                     # If there is a match
    print(m.group(1))     # Print Group 1 value

If you want to make sure test is matched as a whole word, add \b before it (do not remove the r prefix from the string literal, or '\b' will match a BACKSPACE char!) - r'\btest\s*:\s*(.*)\.'.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
5

I don't see why you want to use regex if you're just getting a subset from a string.

This works the same way:

if line.startswith('test:'):
    print(line[5:line.find('.')])

example:

>>> line = "test: match this."
>>> print(line[5:line.find('.')])
 match this

Regex is slow, it is awkward to design, and difficult to debug. There are definitely occassions to use it, but if you just want to extract the text between test: and ., then I don't think is one of those occasions.

See: https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions

For more flexibility (for example if you are looping through a list of strings you want to find at the beginning of a string and then index out) replace 5 (the length of 'test:') in the index with len(str_you_looked_for).

Community
  • 1
  • 1
NDevox
  • 4,056
  • 4
  • 21
  • 36
  • I completely agree. I normally try to avoid regex as much as possible. But I need to match a lot of other strings in a large number of web pages. – Amith May 19 '15 at 13:27
  • Are they different? Can they not be seen in a similar way? – NDevox May 19 '15 at 13:28
  • Just went through the link you provided. Very interesting. You have made me reconsider my approach!! – Amith May 19 '15 at 13:32
  • 1
    @Amith I agree, no strings attached :) I myself like to provide 2 solutions, regex-based and non-regex based. No need this time. – Wiktor Stribiżew May 19 '15 at 13:34
  • @Amith, no problem. Regex is one of those things that is really easy to abuse. – NDevox May 19 '15 at 13:35
  • Hate to revive an old thread. But why is this marked as the answer? The question asks for REGEX. When people search for this specific topic and this question shows up with an unrelated answer it wastes their time. – Luigi Jan 11 '21 at 02:42
  • Because the person who asked the question decided it best fit their problem. Whilst I appreciate the comment and agree, for some the regex by avinash may be better suited, that doesn't make it an invalid or worse answer for many others. I would also ask why anyone thinks regex would be better suited for this problem? (find the text appearing after another specific piece of text) – NDevox Jan 12 '21 at 09:56