0

I'm trying to split on a lookahead, but it doesn't work for the last occurrence. How do I do this?

my_str = 'HRC’s'
import re
print(re.split(r'.(?=&)', my_str))

My output:

['HR', '&#226', '&#128', '™s']

My desired output:

['HRC', '&#226', '&#128', '&#153', 's']
TigerhawkT3
  • 48,464
  • 6
  • 60
  • 97
mtkilic
  • 1,213
  • 1
  • 12
  • 28
  • 3
    Hint: It has something to do with `.`... –  Feb 09 '17 at 21:04
  • The input (which contains no colons) does not match the stated output (which has a colon in the last item). – TigerhawkT3 Feb 09 '17 at 21:05
  • @JackManey I did remove `.` but still same – mtkilic Feb 09 '17 at 21:06
  • 1
    Why would you expect `'...™s'` to split into `[... '™', 's']` – AJNeufeld Feb 09 '17 at 21:06
  • @TigerhawkT3 I changed that manually because if I leave as `;` string converts to ascii character – mtkilic Feb 09 '17 at 21:06
  • @AJNeufeld Because I will convert each `™` to ascii character – mtkilic Feb 09 '17 at 21:07
  • 1
    Irrelevant...why would you expect that split to change `™s` into `[™, s]`? Either it's your expected output after the call to `re.split` or it isn't. –  Feb 09 '17 at 21:08
  • And for that matter, there are no (visible) characters between `HRC` and `â` in your string above. I'm not sure what you think you're trying to accomplish with any kind of split, here... –  Feb 09 '17 at 21:09
  • @JackManey After i split each characters to `â` etc. I will get rid of `` then use `chr(226)` to get ascii chr. – mtkilic Feb 09 '17 at 21:14
  • 1
    If all you want to do is decode HTML entities, you should do that instead of messing with regex. See the linked duplicate. – TigerhawkT3 Feb 09 '17 at 21:15
  • @MahmutKilic I'm not sure if you know what it means to `split` a string into a list of strings. When you perform a `split`, the delimiter (or resulting match from a delimiter pattern, in the case of `re.split`) **does not show up in any of the elements in the resulting list of strings**. Why would you expect `'HRCâ'` to split into `['HRC', 'â']`??? –  Feb 09 '17 at 21:15
  • @JackManey What will be your suggestion for me then? I am kinda struggling how to solve this issue. – mtkilic Feb 09 '17 at 21:19
  • @MahmutKilic Don't `split`. If you want to decode HTML entities, then just decode them. –  Feb 09 '17 at 21:20
  • @JackManey http://paste.ubuntu.com/23962832/ Take a look at this. This is what I am trying to do. – mtkilic Feb 09 '17 at 21:21
  • i do not know how to specifically pick each `â` from text file – mtkilic Feb 09 '17 at 21:21
  • Well, are the things you're trying to decode always of the form `` followed by digits? –  Feb 09 '17 at 21:22
  • Actually, scratch that, **just decode the damn HTML entities**. Looked at the linked duplicate to this question: http://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string?noredirect=1&lq=1 –  Feb 09 '17 at 21:24
  • @JackManey Thank you i will take look at the link now – mtkilic Feb 09 '17 at 21:25

1 Answers1

3

The solution using re.findall() function:

my_str = 'HRC’s'
result = re.findall(r'\w+|&#\d+(?=;)', my_str)
print(result)

The output:

['HRC', '&#226', '&#128', '&#153', 's']
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105