Counterpart to PHP’s preg_match in Python

Question

I am planning to move one of my scrapers to Python. I am comfortable using preg_match and preg_match_all in PHP. I am not finding a suitable function in Python similar to preg_match. Could anyone please help me in doing so?

For example, if I want to get the content between <a class="title" and </a>, I use the following function in PHP:

preg_match_all('/a class="title"(.*?)<\/a>/si',$input,$output);

Whereas in Python I am not able to figure out a similar function.

Here's the python regex docs: http://docs.python.org/howto/regex.html — Ben Lee, Jan 30 '12 at 09:39
In Python we don't use regular expressions for parsing HTML, we use [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/). See http://stackoverflow.com/a/1732454/78845 — johnsyweb, Jan 30 '12 at 09:44

score 14 · Accepted Answer · edited May 23 '17 at 12:30

14

You looking for python's re module.

Take a look at re.findall and re.search.

And as you have mentioned you are trying to parse html use html parsers for that. There are a couple of option available in python like lxml or BeautifulSoup.

Take a look at this Why you should not parse html with regex

edited May 23 '17 at 12:30

Community

1
1

answered Jan 30 '12 at 09:39

RanRag

48,359
38
114
167

Thanks gentlemen for your replies. I have started using Beatifulsoup and I am facing some problems using it. I have passed the html data to Beatifulsopu and I am facing this error. soup = BeautifulSoup(data) print soup.prettify() line 52, in soup = BeautifulSoup(data) File "/home/infoken-user/Desktop/lin/BeautifulSoup.py", line 1519, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs) File "/home/infoken-user/Desktop/lin/BeautifulSoup.py", line 1144, .. '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data) TypeError: expected string or buffer – funnyguy Jan 30 '12 at 12:54

Vasin Yuriy · Answer 2 · 2016-07-22T07:19:55.447

I think you need somthing like that:

output = re.search('a class="title"(.*?)<\/a>', input, flags=re.IGNORECASE)
    if output is not None:
        output = output.group(0)
        print(output)

you can add (?s) at the start of regex to enable multiline mode:

output = re.search('(?s)a class="title"(.*?)<\/a>', input, flags=re.IGNORECASE)
    if output is not None:
        output = output.group(0)
        print(output)

score 2 · Answer 3 · answered Jan 30 '12 at 09:40

2

You might be interested in reading about Python Regular Expression Operations

answered Jan 30 '12 at 09:40

Tudor Constantin

26,330
7
49
72

Counterpart to PHP’s preg_match in Python

3 Answers3

Linked