Splitting on a lookahead

Question

I'm trying to split on a lookahead, but it doesn't work for the last occurrence. How do I do this?

my_str = 'HRC&#226;&#128;&#153;s'
import re
print(re.split(r'.(?=&)', my_str))

My output:

['HR', '&#226', '&#128', '&#153;s']

My desired output:

['HRC', '&#226', '&#128', '&#153', 's']

The input (which contains no colons) does not match the stated output (which has a colon in the last item). — TigerhawkT3, Feb 09 '17 at 21:05
@TigerhawkT3 I changed that manually because if I leave as `;` string converts to ascii character — mtkilic, Feb 09 '17 at 21:06
@AJNeufeld Because I will convert each `` to ascii character — mtkilic, Feb 09 '17 at 21:07
Irrelevant...why would you expect that split to change `s` into `[, s]`? Either it's your expected output after the call to `re.split` or it isn't. — , Feb 09 '17 at 21:08
And for that matter, there are no (visible) characters between `HRC` and `â` in your string above. I'm not sure what you think you're trying to accomplish with any kind of split, here... — , Feb 09 '17 at 21:09
@JackManey After i split each characters to `â` etc. I will get rid of `` then use `chr(226)` to get ascii chr. — mtkilic, Feb 09 '17 at 21:14
If all you want to do is decode HTML entities, you should do that instead of messing with regex. See the linked duplicate. — TigerhawkT3, Feb 09 '17 at 21:15
@MahmutKilic I'm not sure if you know what it means to `split` a string into a list of strings. When you perform a `split`, the delimiter (or resulting match from a delimiter pattern, in the case of `re.split`) **does not show up in any of the elements in the resulting list of strings**. Why would you expect `'HRCâ'` to split into `['HRC', 'â']`??? — , Feb 09 '17 at 21:15
@JackManey What will be your suggestion for me then? I am kinda struggling how to solve this issue. — mtkilic, Feb 09 '17 at 21:19
@MahmutKilic Don't `split`. If you want to decode HTML entities, then just decode them. — , Feb 09 '17 at 21:20
@JackManey http://paste.ubuntu.com/23962832/ Take a look at this. This is what I am trying to do. — mtkilic, Feb 09 '17 at 21:21
i do not know how to specifically pick each `â` from text file — mtkilic, Feb 09 '17 at 21:21
Well, are the things you're trying to decode always of the form `` followed by digits? — , Feb 09 '17 at 21:22
Actually, scratch that, **just decode the damn HTML entities**. Looked at the linked duplicate to this question: http://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string?noredirect=1&lq=1 — , Feb 09 '17 at 21:24

score 3 · Accepted Answer · answered Feb 09 '17 at 21:12

3

The solution using re.findall() function:

my_str = 'HRC&#226;&#128;&#153;s'
result = re.findall(r'\w+|&#\d+(?=;)', my_str)
print(result)

The output:

['HRC', '&#226', '&#128', '&#153', 's']

answered Feb 09 '17 at 21:12

RomanPerekhrest

88,541
4
65
105

Thank you so much this works great!! – mtkilic Feb 09 '17 at 21:15
@MahmutKilic, you're welcome – RomanPerekhrest Feb 09 '17 at 21:15
1

An alternative: `result = re.sub(r"(\d+);", lambda m: chr(int(m.group(1))), my_str)` – AJNeufeld Feb 09 '17 at 21:20

Splitting on a lookahead

1 Answers1