0

Can you please help me to understand the following line of the code:

import re 
a= re.findall('[А-Яа-я-\s]+', string)

I am a bit confused with the pattern that has to be found in the string. Particularly, a string should start with A and end with any string in-between A and я, should be separated by - and space, but what does the second term Яа stand for?

Alberto Alvarez
  • 805
  • 3
  • 11
  • 20
  • 1
    `Яа` doesn't stand for anything. It's two ranges, `А-Я` and `а-я`. The first is the uppercase Cyrillic letters, the second is lowercase letters. – Barmar Jan 06 '23 at 20:34
  • 1
    Why do you think it should start with `А`? That's inside a character set, so it's the start of a character range, not the start of the pattern. – Barmar Jan 06 '23 at 20:35

1 Answers1

2
[         ]      any of the characters in here
 А-Я             any character from А and Я, inclusive
    а-я          any character between а and я, inclusive
       -         the character -   (this is ambiguous; it should only be at the very start or end of the class)
        \s       any whitespace character
           +     at least one of the preceding class of characters

[А-Яа-я-\s]+     at least one character between А and Я (uppercase or lowercase), a dash, or whitespace

the [] is called a "class" in regex, and it's basically meant to say "any of the characters inside here is valid". And then + means "at least one occurrence of the preceding character/class". Python has a Regular Expressions HowTo that you might find useful to read through.

Green Cloak Guy
  • 23,793
  • 4
  • 33
  • 53