Removing variable length characters from a string in python

Question

I have strings that are of the form below:

<p>The is a string.</p>
<em>This is another string.</em>

They are read in from a text file one line at a time. I want to separate these into words. For that I am just splitting the string using split().

Now I have a set of words but the first word will be <p>The rather than The. Same for the other words that have <> next to them. I want to remove the <..> from the words.

I'd like to do this in one line. What I mean is I want to pass as a parameter something of the form <*> like I would on the command line. I was thinking of using the replace() function to try to do this, but I am not sure how the replace() function parameter would look like.

For example, how could I change <..> below in a way that it will mean that I want to include anything that is between < and >:

x = x.replace("<..>", "")

Why don't you just use a parser like [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) since these are just tags? — Cory Kramer, Jul 19 '14 at 21:05
[HTML should not be parsed with regex](http://stackoverflow.com/a/1732454/198633). Try a parser like Beautiful Soup or etree instead — inspectorG4dget, Jul 19 '14 at 21:06
For the record, because I find this annoying, you picked a TWO-STEP solution (replace and split) when SINGLE-STEP solutions were available, and the solution you picked, which started with a lazy quantifier, migrated in the direction of my regex. — zx81, Jul 19 '14 at 21:32
@zx81. Look. The question isn't asking for a one-step solution. It's asking how to remove `<..>`. When I say I want to do it in one line, I am saying I want to remove `<..>` of all lengths in one line. — Mars, Jul 19 '14 at 21:40
Nah, that's not an acceptable reply to my comment. You said you want to extract the words... That's the purpose... And if in an answer we can show you a way to do it in ONE step, discarding what you have tried before, that's always what we try to do. This kind of thing happens, and I normally don't rant about it, but it surprises me coming from someone with 500+ rep. — zx81, Jul 19 '14 at 21:44
@zx81. I asked a specific question and you answered something else. Suppose my string was `
This isn't a string
.`. Wouldn't your example split `isn't` into `isn'` and `t`? And maybe you'll come up with a correction. Then what? I give you another counterexample and you'll correct that too? I wanted a solution to something specific. Your answer assumes more than what was asked. — Mars, Jul 19 '14 at 22:16

score 3 · Accepted Answer · answered Jul 19 '14 at 21:06

3

Unfortunately, str.replace does not support Regex patterns. You need to use re.sub for this:

>>> from re import sub
>>> sub("<[^>]*>", "", "<p>The is a string.</p>")
'The is a string.'
>>> sub("<[^>]*>", "", "<em>This is another string.</em>")
'This is another string.'
>>>

[^>]* matches zero or more characters that are not >.

answered Jul 19 '14 at 21:06

1

It's probably better to use `(<[^>]*>)?` for the regex. – Ed L Jul 19 '14 at 21:08
He wants to retrieve the words, so this will be a two-step solution, right? Some splitting will need to take place. – zx81 Jul 19 '14 at 21:31
@zx81 - Well, in that case, all he needs to do is `sub("<[^>]*>", "", "
The is a string.
").split()`. We don't need anything fancy because he said that he is getting the lines one at a time and that they are all of the same format. – Jul 19 '14 at 22:01

score 2 · Answer 2 · edited May 23 '17 at 10:25

2

No Need for a 2-Step Solution

You don't need to 1. Split then 2. Replace. The two solutions below show you how to do it with one single step.

Option 1: Match All Instead of Splitting

Match All and Split are Two Sides of the Same Coin, and in this case it is safer to match all:

<[^>]+>|(\w+)

The words will be in Group 1.

Use it like this:

subject = '<p>The is a string.</p><em>This is another string.</em>'
regex = re.compile(r'<[^>]+>|(\w+)')
matches = [group for group in re.findall(regex, subject) if group]
print(matches)

Output

['The', 'is', 'a', 'string', 'This', 'is', 'another', 'string']

Discussion

This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."

The left side of the alternation | matches complete <tags>. We will ignore these matches. The right side matches and captures words to Group 1, and we know they are the right ones because they were not matched by the expression on the left.

Reference

Option 2: One Single Split

<[^>]+>|[ .]

On the left side of the |, we use <complete tags> as a split delimiter. On the right side, we use a space character or a period.

Output

This
is
a
string

edited May 23 '17 at 10:25

Community

1
1

answered Jul 19 '14 at 21:07

zx81

41,100
9
89
105

FYI: Added simple code (really two lines) for the Group 1 option, which IMO is more solid than the split option. – zx81 Jul 19 '14 at 21:18
Hey, I gave two solutions that require a SINGLE step: you don't need to 1 split, then 2 replace. Why did you choose a 2-step solution? That makes no sense to me. My Option 1 is ONE step (just match all). My Option 2 is ONE step (just split) – zx81 Jul 19 '14 at 21:25
On top of that, the solution you picked was edited to essentially use my regex `<[^>]+>`. – zx81 Jul 19 '14 at 21:30
My question was how to remove all `<..>` from a string. You put removing and splitting together but I was only asking about removing it, which the accepted answer provides. I do want to split also but my question didn't ask how to perform the two in one step. – Mars Jul 19 '14 at 21:38
But you **ARE** splitting, and answers on SO always try to give you a better way to do things. If you can do it in one step, that's what we show you. I have nothing against the replacement by @iCodez, but in my view you made an extraordinarily poor choice, and I am sure that even he would agree, as I would if the situation were reversed. I normally don't rant about this kind of thing, it happens a lot, but usually not with someone with over 500 rep. – zx81 Jul 19 '14 at 21:42

Removing variable length characters from a string in python

2 Answers2

No Need for a 2-Step Solution

Option 1: Match All Instead of Splitting

Option 2: One Single Split