How to remove sub-string starting and ending with something?

Question

How can I remove a sub-string from a string starting and ending with a certain character combination like:

' bla <span class=""latex""> ... This can be different1 ... </span> blub <span class=""latex""> ... This can be different2 ... </span> bleb'

That I want as result:

'bla blub bleb'

I tried something like this

string.replace('<span class=""latex"">' * '</span>', '')

but this does not work.

Is there a way to implement this?

Justin have you tried python's `re` package and specifically the `re.sub()` function? You will need regular expressions to do this. There are plenty of answered question on the topic of regex and html. — mayosten, Oct 18 '19 at 18:11

Alex K. · Answer 1 · 2019-10-18T18:20:26.227

3

Read about re.sub function.

A simple example:

import re

s = ' cvbcx cvbcx <span class=""latex""> ... This can be different ... </span>vcvbcxbvxc'
re.sub(r'<span class=""latex"">.+</span>', '<span class=""latex""></span>', s)

>> ' cvbcx cvbcx <span class=""latex""></span>vcvbcxbvxc'

edited Oct 18 '19 at 18:20

answered Oct 18 '19 at 18:11

Alex K.

835
6
15

1

This will replace the whole `` with an empty string, not just the data in between the ``. – slider Oct 18 '19 at 18:13
@slider Yep, edited the answer so that a found pattern is replaced with the tag instead of an empty string – Alex K. Oct 18 '19 at 18:22
I updated the question, since I have the problem that I have multiple after each other and the sub strings in the middel (here 'blub') will be dropped. – Justus Erker Oct 18 '19 at 23:51

jrook · Accepted Answer · 2019-10-19T06:21:25.430

3

This could work:

>>> import re
>>> x=re.sub(r"""<span class=""latex"">.+?</span>""", "", s)

>>> x
' bla  blub  bleb'

Regex101

EDIT : after clarification by the OP, changed the answer to use lazy quantifier instead of capturing group. While this works, it is not scalable to more complex cases. If that is the case, the proper solution would be to parse the string and extract what is needed.

edited Oct 19 '19 at 06:21

answered Oct 18 '19 at 18:21

jrook

3,459
1
16
33

score 1 · Answer 3 · answered Oct 18 '19 at 18:22

1

You will need to use groups if you want some parts and not others.

import re

s = ' cvbcx cvbcx <span class=""latex""> ... This can be different ... </span>vcvbcxbvxc'
r = re.search( r'(<span class=""latex"">)(.+)(</span>)', s)

print(s)
# cvbcx cvbcx <span class=""latex""> ... This can be different ... </span>vcvbcxbvxc

# print(r)
# <re.Match object; span=(13, 73), match='<span class=""latex""> ... This can be different >

print(r.group(1), r.group(3))
# <span class=""latex""> </span>

answered Oct 18 '19 at 18:22

shanecandoit

581
1
3
11

Groups are added by wrapping parts in parenthesis. Group indexing statrts at 1. – shanecandoit Oct 18 '19 at 18:23
interesting approach. How can you use it if you have multiple after each other? (see updated question) Since I want to avoid skipping ones in the middle that code which is not in a is not removed. I tried this: `group = re.search(r"""()(.+)^((?!).)*$(.+)()""", string)` but this didn't work do you have an idea? – Justus Erker Oct 19 '19 at 12:09

score 1 · Answer 4 · answered Oct 18 '19 at 18:27

If you want to keep the data in between:

    >>> x
'<span class=""latex""> ... This can be different ... </span>'
>>> 
>>> d = re.sub('<(/)?span(\ class=\"\".*\"\")?(>)', '', x)
>>> 
>>> d
' ... This can be different ... '
>>>

If you want to keep the tags:

>>> x
'<span class=""latex""> ... This can be different ... </span>'
>>> 
>>> 
>>> 
>>> new_data = 'abc 123 456'
>>> 
>>> 
>>> d = re.sub('\">.*</','\">{}</'.format(new_data),x)
>>> 
>>> 
>>> d
'<span class=""latex"">abc 123 456</span>'
>>> 
>>> 
>>>

How to remove sub-string starting and ending with something?

4 Answers4