32

I currently use re.findall to find and isolate words after the '#' character for hash tags in a string:

hashtags = re.findall(r'#([A-Za-z0-9_]+)', str1)

It searches str1 and finds all the hashtags. This works however it doesn't account for accented characters like these for example: áéíóúñü¿.

If one of these letters are in str1, it will save the hashtag up until the letter before it. So for example, #yogenfrüz would be #yogenfr.

I need to be able to account for all accented letters that range from German, Dutch, French and Spanish so that I can save hashtags like #yogenfrüz

How can I go about doing this

Ibrahim Najjar
  • 19,178
  • 4
  • 69
  • 95
deadlock
  • 7,048
  • 14
  • 67
  • 115

5 Answers5

34

Try the following:

hashtags = re.findall(r'#(\w+)', str1, re.UNICODE)

Regex101 Demo

EDIT Check the useful comment below from Martijn Pieters.

Ibrahim Najjar
  • 19,178
  • 4
  • 69
  • 95
  • 11
    Small caveat: `\w` won't match combined codepoints, so `a` and [U+0301 COMBINING ACUTE ACCENT](https://codepoints.net/U+0301) won't be matched, even though that *prints* as `á`. You may want to normalise to NFC, first. – Martijn Pieters Oct 02 '16 at 13:09
  • 1
    @MartijnPieters Thanks for sharing, always something extra to learn. – Ibrahim Najjar Oct 04 '16 at 14:29
  • @IbrahimNajjar can you implement the fix mentioned by Martijn Pieters to your solution? Thanks. – Robert Valencia Apr 17 '17 at 18:00
  • 2
    @RobertValencia Unless you really come across the situation he describes then my solution still works with accented characters. I am honestly not a Unicode expert and don't know the details exactly but if you want to normalize like he suggests then check the other answer to this question. Hope that helps – Ibrahim Najjar Apr 22 '17 at 15:26
  • Interestingly enough, I think I'm experiencing the exact problem @MartijnPieters described, except with é and e. I used the solution suggested by Berk below, and then decoded bytes object back to a string. Thanks all! – bddicken Sep 13 '17 at 18:31
18

I know this question is a little outdated but you may also consider adding the range of accented characters À (index 192) and ÿ (index 255) to your original regex.

hashtags = re.findall(r'#([A-Za-z0-9_À-ÿ]+)', str1)

which will return ['yogenfrüz']

Hope this'll help anyone else.

zanga
  • 612
  • 4
  • 20
  • This answer is so elegant. It never occurred to me such range could be used. Thank you. – ecv Oct 04 '22 at 09:48
  • Is that a typo that underscore before the À-ÿ tho? – ecv Oct 04 '22 at 09:49
  • 1
    Thank you @ecv, the underscores comes from the question, he wanted to include an underscore based on the original post, I just added the range of accented characters – zanga Oct 05 '22 at 20:10
4

You may also want to use

import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')

how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a? Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...

import unicodedata output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore') Explicit example...

myfoo = u'àà'
myfoo
u'\xe0\xe0'
unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'

check this answer it helped me a lot: How to convert unicode accented characters to pure ascii without accents?

Community
  • 1
  • 1
Berk
  • 338
  • 5
  • 11
  • I downvoted because this is the opposite of what OP wants. They want to account for accents; so removing them is not a solution. – bfontaine Feb 10 '22 at 20:28
0

Here's an update to Ibrahim Najjar's original answer based on the comment Martijn Pieters made to the answer and another answer Martijn Pieters gave in https://stackoverflow.com/a/16467505/5302861:

import re
import unicodedata

s = "#ábá123"
n = unicodedata.normalize('NFC', s)

print(n)
c = ''.join(re.findall(r'#\w+', n, re.UNICODE))
print(s, len(s), c, len(c))
Shabbir Khan
  • 187
  • 1
  • 8
  • @shabbir_khan not all base character and combining diacritic combinations have a precomposed form, i.e. the word kɔ̈ɔ̈r is identical in NFC, NFKC, NFKC_CF, NFD and NFLD. – Andj Mar 20 '23 at 06:23
0

Building on all the other answers:

The key problem is that the re module differs in significant ways to other regular expression engines. In theory, Unicode's definition of \w metacharacter would do what the question requires, but the re module does not implement Unicode's \w metacharacter.

The easy solution is to swap the regular expression engine, using a solution that is more compatible. The easiest way is to install the regex module and use it. The code that some of the other answers have given will then work as the question needs.

import regex as re
# import unicodedata as ud
import unicodedataplus as ud
hashtags = re.findall(r'#(\w+)', ud.normalize("NFC",str1))

Or if you only what to focus on Latin script, including non-spacing marks (i.e. combining diacritics):

import regex as re
# import unicodedata as ud
import unicodedataplus as ud
hashtags = re.findall(r'#([\p{Latin}\p{Mn}]+)', ud.normalize("NFC",str1))

P.S. I have used unicodedataplus which is a drop-in replacement for unicodedata. It has additional methods, and it is kept up to date with Unicode versions. With unicodedata module to up date the Unicode version required updating Python.

Andj
  • 481
  • 3
  • 8