10

I'm trying to extract salaries from a list of strings. I'm using the regex findall() function but it's returning many empty strings as well as the salaries and this is causing me problems later in my code.


sal= '41 000€ à 63 000€ / an' #this is a sample string for which i have errors

regex = ' ?([0-9]* ?[0-9]?[0-9]?[0-9]?)'#this is my regex

re.findall(regex,sal)[0]
#returns '41 000' as expected but:
re.findall(regex,sal)[1]
#returns: '' 
#Desired result : '63 000'

#the whole list of matches is like this:
['41 000',
 '',
 '',
 '',
 '',
 '',
 '',
 '63 000',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '']
# I would prefer ['41 000','63 000']

Can anyone help? Thanks

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Ceal Clem
  • 225
  • 4
  • 10
  • Your pattern can match an empty string, so actually you *asked* for it. What is the pattern you want to match? Numbers with a space as digit grouping symbol? Try `r'(?<!\d)\d{1,3}(?: \d{3})*(?!\d)'` – Wiktor Stribiżew Apr 04 '19 at 10:11
  • You could try this patttern `(\d+(?: \d{1,3})?)€` with findall to return only the salaries. [Demo](https://regex101.com/r/EUCGOw/1) – The fourth bird Apr 04 '19 at 10:11
  • np.concatenate(re.findall(regex,sal)[0],re.findall(regex,sal)[1]) – mohan111 Apr 04 '19 at 10:11
  • Do you want to extract only the numbers that are followed with `€`? Try `r'(?<!\d)(\d{1,3}(?:[ \xA0]\d{3})*)\s*€'` then, or `r'(?<!\d)(\d+|\d{1,3}(?:[ \xA0]\d{3})*)\s*€'`. See https://regex101.com/r/rwbpTx/1 – Wiktor Stribiżew Apr 04 '19 at 10:18
  • thank you everyone! – Ceal Clem May 20 '19 at 15:06

1 Answers1

10

Using re.findall will give you the capturing groups when you use them in your pattern and you are using a group where almost everything is optional giving you the empty strings in the result.

In your pattern you use [0-9]* which would match 0+ times a digit. If there is not limit to the leading digits, you might use [0-9]+ instead to not make it optional.

You might use this pattern with a capturing group:

(?<!\S)([0-9]+(?: [0-9]{1,3})?)€(?!\S)

Regex demo | Python demo

Explanation

  • (?<!\S) Assert what is on the left is not a non whitespace character
  • ( Capture group
    • [0-9]+(?: [0-9]{1,3})? match 1+ digits followed by an optional part that matches a space and 1-3 digits
  • ) Close capture group
  • Match literally
  • (?!\S) Assert what is on the right is not a non whitespace character

Your code might look like:

import re
sal= '41 000€ à 63 000€ / an' #this is a sample string for which i have errors
regex = '(?<!\S)([0-9]+(?: [0-9]{1,3})?)€(?!\S)'
print(re.findall(regex,sal))  # ['41 000', '63 000']
The fourth bird
  • 154,723
  • 16
  • 55
  • 70