Why is re.findall behaving like this? (python regex)

Question

I made a small program in pyhton that searches through a music website and collects music data. The music has a format of [artist] - [music name] [music file format]. At first I used re.search to find a certain artist (I used regex because there are some other characters and irregularities in the music info above, and the only indicator for finding the artist was the - following the artist).

Somehow it didn't work so I changed it to re.findall just in case but it still didn't work. since I'm a beginner at python I thought I sis something wrong so I wrote some test code to study what was wrong. And this is what I got.

when I changed the x string (which would be the music info) and ran re.findall again it gave me a different result(none). I 100% thought the result would be the same. why is this behaving like this? And could this be the reason why my original code's re.serach, re.findall wasn't working?

I've included the code just in case. (used selenium)

idx = 1
while True:
        try:
            hxp1 = "(//h3[@class='entry-title td-module-title']/a)[" + str(idx) + "]" 

            text = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, hxp1)))

            # info = eg) 'Michael Jackson - Beat it [FLAC, MP3, WAV]'
            info = text.get_attribute('title') # get 'info' as string
            
            # ARTIST = eg) 'Michael Jackson'
            regex = ARTIST + ' - '
            match = re.findall(regex, info) # or use re.search
            
            # do something with 'match'...

            idx += 1

        except:
            # do something...
            break

Two things: 1) are you sure the `-` is a hyphen and not some Unicode (e[mn]-)?dash? 2) Are you sure the spaces are regular spaces and not some hard spaces? Try `Minami\s[-—–]\s`. — Wiktor Stribiżew, Aug 26 '21 at 17:51
@WiktorStribiżew thank you `Minami\s[-—–]\s` works. May I ask the difference between hard and regular space? And also what does `[---]` mean? — Maxjo, Aug 26 '21 at 17:56
There are 20 different space characters in Unicode. https://jkorpela.fi/chars/spaces.html — Tim Roberts, Aug 26 '21 at 17:58
@Maxjo there are multiple different text characters that may look like dashes to you, or like blank spaces, of varying widths depending on not only the font you use, but the context in which the text is written. Your regex can only replace the text if it looks for the version that the text actually contains. `[-—–]` that you copied and pasted is **not the same text as** the `[---]` that you typed by hand. — Karl Knechtel, Aug 26 '21 at 18:16

score 1 · Accepted Answer · answered Aug 26 '21 at 18:09

It seems you need to make sure you match

any Unicode whitespaces (i.e. \s in Python 3.x, or (?u)\s in Python 2.x, see re documentation: "Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages).")
any Unicode hyphens (see Searching for all Unicode variation of hyphens in Python).

Combining all that into your regex:

Minami\s[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]\s

In your case, if you just need to support en-dash/em-dash/hyhen chars and any Unicode whitespace chars, you can use

Minami\s[-—–]\s

Another possibility is to [normalize](https://stackoverflow.com/questions/16467479/normalizing-unicode) the string first. — Karl Knechtel, Aug 26 '21 at 18:17
@KarlKnechtel When the task is to check if there is a match or not, yes. If the string needs to be extracted, then it depends on the requirements. — Wiktor Stribiżew, Aug 26 '21 at 18:21

Why is re.findall behaving like this? (python regex)

1 Answers1