0

I have a paragraph/sentence from which I want to identify

  1. any series of number 6 digits or more
  2. any series of numbers with a "-" (dash)

but I don't want to identify

  1. any numbers preceded by a $(dollar)
  2. any series of numbers with , (comma)

How can I achieve this?

The regex I tried is: r'(?:\s|^)(\d-?(\s)?){6,}(?=[?\s]|$)' but its not accurate.

I'm looking for these patterns inside a paragraph

  • 123-456-789
  • 123-456
  • 123 456
  • 123 456 789 It may also contain full stop(.) at the end too but it should ignore the following patterns

  • $123654

  • $ 123654
  • 12,4569
  • 123*123*7732
  • 123h434k5454
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
Pokemon
  • 63
  • 11
  • 1
    Do you mean like this perhaps using a capturing group? `(?<!\S)(?:\$(?:\d+(?:\,\d+)?)|(\d+(?:-\d+)+|6+))(?!\S)` https://regex101.com/r/VEVU8L/1 – The fourth bird Apr 20 '20 at 10:12
  • Maybe `(?<![\d$])(?<!\d,)(?:\d+(?:-\d+)+|\d{6,})(?![\d,])` will do? – Wiktor Stribiżew Apr 20 '20 at 10:18
  • With 6 digits instead of the number 6 `(?<!\S)(?:\$(?:\d+(?:\,\d+)?)|(\d+(?:-\d+)+|\d{6,}))(?!\S)` https://regex101.com/r/YW6Md5/1 – The fourth bird Apr 20 '20 at 10:18
  • Please see [ask] a question with a [mcve] and include sample data and expected output. That would help identify which regex is best to use. I suspect you are using `Python`? – JvdV Apr 20 '20 at 10:20
  • Yes its almost accurate, but I want to ignore any number preceding with dollar+space too eg. $123654, $ 123654 – Pokemon Apr 20 '20 at 10:22
  • You could match 0 or more whitespace chars after matching the dollar sign `(?<!\S)(?:\$\s*(?:\d+(?:\,\d+)?)|(\d+(?:-\d+)+|\d{6,}))(?!\S)` https://regex101.com/r/R5C0bl/1 It matches all values, but the values you are looking for to keep are in group 1 (Highlighted in green on regex101) – The fourth bird Apr 20 '20 at 10:25
  • I've updated the question with samples. @Thefourthbird – Pokemon Apr 20 '20 at 10:26
  • Its for a image to text conversion program. so I want to ignore secure information like SSNs, a/c numbers... so the image can be handwritten documents. so Just added that criteria forecasting the issue when adding a space between numbers in the handwritten doc..it can be avoided too – Pokemon Apr 20 '20 at 10:30
  • Hi @Thefourthbird 123 456 123 456 789 can be avoided. If it is required in future, how can I add that condition? – Pokemon Apr 20 '20 at 10:32
  • You could change the quantifier from `{6,}` to `{3,}` https://regex101.com/r/pIHm5Q/1 – The fourth bird Apr 20 '20 at 10:39
  • (?<!\S)(?:\$\s*(?:\d+(?:\,\d+)?)|(\d+(?:[ -]\d+)+|\d{3,}))(?!\S) looks coo except that its not ignoring amounts.. $ and $ patterns..I don't want them at all.. – Pokemon Apr 20 '20 at 10:44
  • python @Thefourthbird – Pokemon Apr 20 '20 at 10:47
  • @PraphulNangeelil See an example in Python https://ideone.com/FzFLrF – The fourth bird Apr 20 '20 at 10:51
  • I've updated with one more condition that I missed. ie to incorporate a full stop at the end, – Pokemon Apr 20 '20 at 10:51
  • If it is for both the hyphenated digits and the digits with spaces, you can make the dot optional `(?<!\S)(?:\$\s*(?:\d+(?:\,\d+)?)|(\d+(?:[ -]\d+)+\.?|\d{3,}))(?!\S)` https://regex101.com/r/i4uxZz/1 – The fourth bird Apr 20 '20 at 10:57
  • ya..got it..but how to get the green color group of the result set only...cos $ and $ are in blue.. i dont want to consider them... – Pokemon Apr 20 '20 at 11:01
  • its not loading @Thefourthbird ... broken page – Pokemon Apr 20 '20 at 11:07
  • @PraphulNangeelil I have added an answer with 2 demo links. I will cleanup the comments a bit as it is a long list – The fourth bird Apr 20 '20 at 11:10

1 Answers1

1

You could match what you don't want and capture in a group what you want to keep.

Using re.findall the group 1 values will be returned.

Afterwards you might filter out the empty strings.

(?<!\S)(?:\$\s*\d+(?:\,\d+)?|(\d+(?:[ -]\d+)+\.?|\d{3,}))(?!\S)

In parts

  • (?<!\S) Assert a whitespace boundary on the left
  • (?: Non capture group
    • \$\s* Match a dollar sign, 0+ whitespace chars
    • \d+(?:\,\d+)? Match 1+ digits with an optional comma digits part
    • | Or
    • ( Capture group 1
      • \d+ Match 1+ digits
      • (?:[ -]\d+)+\.? Repeat a space or - 1+ times followed by an optional .
      • | Or
      • \d{3,} Match 3 or more digits (Or use {6,} for 6 or more
    • ) Close group 1
  • ) Close non capture group
  • (?!\S) Assert a whitespace boundary on the right

Regex demo | Python demo | Another Python demo

For example

import re

regex = r"(?<!\S)(?:\$\s*(?:\d+(?:\,\d+)?)|(\d+(?:[ -]\d+)+\.?|\d{3,}))(?!\S)"

test_str = ("123456\n"
    "1234567890\n"
    "12345\n\n"
    "12,123\n"
    "etc...)

print(list(filter(None, re.findall(regex, test_str))))

Output

['123456', '1234567890', '12345', '1-2-3', '123-456-789', '123-456-789.', '123-456', '123 456', '123 456 789', '123 456 789.', '123 456 123 456 789', '123', '456', '123', '456', '789']
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • my current requirement is to use the result in `if(re.match(regex, field.value.text.lower())):` this returns all matches and groups.. I can not use the re.findall() here..I want only the group1 result in the re.match() – Pokemon Apr 21 '20 at 07:38
  • [re.match](https://docs.python.org/3/library/re.html#re.match) returns a [match object](https://docs.python.org/3/library/re.html#match-objects) from which you can get the [group](https://docs.python.org/3/library/re.html#re.Match.group) – The fourth bird Apr 21 '20 at 07:44
  • You mean, `re.match(regex, field.value.text.lower()).group` like this? – Pokemon Apr 21 '20 at 07:48
  • Like `.group(1)` See this page for an example https://stackoverflow.com/questions/2703029/why-isnt-the-regular-expressions-non-capturing-group-working – The fourth bird Apr 21 '20 at 07:53
  • gotcha! in our case, our matching results are in group(1), right? – Pokemon Apr 21 '20 at 07:56
  • That is correct. Note that re.match matches `If zero or more characters at the beginning of string` Else you could look at [re.search](https://docs.python.org/3/library/re.html#re.search) – The fourth bird Apr 21 '20 at 07:59
  • How can I add the condition to add these type of patterns `+123456565 + 12345675` – Pokemon Apr 21 '20 at 12:45
  • @PraphulNangeelil Hi there, sorry for the late reponse. You can do it like this https://regex101.com/r/yDzRU3/1 You can prepend a optional group with a `+` and an optional space before it `(?:\+ ?)?\d{3,}` – The fourth bird Apr 22 '20 at 08:13
  • if I want to add any other characters in the future, suppose I want to add @ -------- `(?:\+ ?)?(?:\@ ?)?\d{3,}` is this correct? `(?<!\S)(?:\$\s*(?:\d+(?:\,\d+)?)|(\d+(?:[ -]\d+)+\.?|(?:\+ ?)?(?:\@ ?)?\d{6,}))(?!\S)` – Pokemon Apr 22 '20 at 08:28
  • This part `(?:@ ?)?` will accept an `@` followed by an optional space. If you want to allow more characters, you could use a character class allowing any of the listed llike `(?:[@+] ?)?` – The fourth bird Apr 22 '20 at 08:34
  • but it fails when the patter is +123-456-565 – Pokemon Apr 22 '20 at 08:46
  • 1
    If you want it for both the alternations, you could prepend it before the alternation https://regex101.com/r/73VCnN/1 else you have to add it per alternative what you would allow. https://regex101.com/r/tpKoXT/1 Note that you are extending the original question, and accounting for all the side effects will make the pattern larger. – The fourth bird Apr 22 '20 at 08:53
  • 1
    Yes,, the basic requirement has been satisfied by you. but these are the issues that I'm anticipating. that's why I clarified all my queries. Thanks a lot. You helped a lot. – Pokemon Apr 22 '20 at 08:57