466

I am trying to do a grab everything after the '</html>' tag and delete it, but my code doesn't seem to be doing anything. Does .replace() not support regex?

z.write(article.replace('</html>.+', '</html>'))
Vukašin Manojlović
  • 3,717
  • 3
  • 19
  • 31
user1442957
  • 7,191
  • 5
  • 22
  • 19

4 Answers4

778

No. Regular expressions in Python are handled by the re module.

article = re.sub(r'(?is)</html>.+', '</html>', article)

In general:

str_output = re.sub(regex_search_term, regex_replacement, str_input)
Flame
  • 6,663
  • 3
  • 33
  • 53
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • 2
    How would I apply the re model to my 'article' variable? – user1442957 Jul 13 '12 at 18:05
  • I tried the following to no avail `z.write(re.sub(r' – user1442957 Jul 13 '12 at 18:17
  • 4
    Is the tag not lowercase, or is it followed by a `'\n'`? You can make it case-insensitive (`(?i)` flag) and make `.` match newlines (`(?s)` flag) with `r'(?is) – MRAB Jul 13 '12 at 18:32
  • 3
    Using flags would be more readable, i.e. adding flags=re.DOTALL | re.IGNORECASE as last argument iso the (?is) in the pattern. – parvus Jul 08 '21 at 05:14
101

In order to replace text using regular expression use the re.sub function:

sub(pattern, repl, string[, count, flags])

It will replace non-everlaping instances of pattern by the text passed as string. If you need to analyze the match to extract information about specific group captures, for instance, you can pass a function to the string argument. more info here.

Examples

>>> import re
>>> re.sub(r'a', 'b', 'banana')
'bbnbnb'

>>> re.sub(r'/\d+', '/{id}', '/andre/23/abobora/43435')
'/andre/{id}/abobora/{id}'
Andre Pena
  • 56,650
  • 48
  • 196
  • 243
8

You can use the re module for regexes, but regexes are probably overkill for what you want. I might try something like

z.write(article[:article.index("</html>") + 7]

This is much cleaner, and should be much faster than a regex based solution.

Julian
  • 2,483
  • 20
  • 20
  • 12
    Not so clean; you have to hard-code the length of " – Daniel Griscom Feb 28 '16 at 20:44
  • @DanielGriscom : what about `len(str(' – Ole Aldric Mar 03 '18 at 13:35
  • @OleAnders Better, but then you're duplicating that string, which opens another possibility for error. – Daniel Griscom Mar 03 '18 at 14:30
  • @OleAnders ... and just realized; no need for the `str()`; just use `len(' – Daniel Griscom Mar 03 '18 at 16:00
  • 4
    I was pretty much assuming this was a throwaway script - both the regex approach and the string search approach have all sorts of inputs they'll fail on. For anything in production, I would want to be doing some sort of more sophisticated parsing than either regex or simple string search can accomplish. – Julian Mar 03 '18 at 18:42
6

For this particular case, if using re module is overkill, how about using split (or rsplit) method as

se='</html>'
z.write(article.split(se)[0]+se)

For example,

#!/usr/bin/python

article='''<html>Larala
Ponta Monta 
</html>Kurimon
Waff Moff
'''
z=open('out.txt','w')

se='</html>'
z.write(article.split(se)[0]+se)

outputs out.txt as

<html>Larala
Ponta Monta 
</html>
norio
  • 3,652
  • 3
  • 25
  • 33