How to identify a series of numbers inside a paragraph

Question

I have a paragraph/sentence from which I want to identify

any series of number 6 digits or more
any series of numbers with a "-" (dash)

but I don't want to identify

any numbers preceded by a $(dollar)
any series of numbers with , (comma)

How can I achieve this?

The regex I tried is: r'(?:\s|^)(\d-?(\s)?){6,}(?=[?\s]|$)' but its not accurate.

I'm looking for these patterns inside a paragraph

123-456-789
123-456
123 456
123 456 789 It may also contain full stop(.) at the end too but it should ignore the following patterns
$123654
$ 123654
12,4569
123*123*7732
123h434k5454

Do you mean like this perhaps using a capturing group? `(?<!\S)(?:\$(?:\d+(?:\,\d+)?)|(\d+(?:-\d+)+|6+))(?!\S)` https://regex101.com/r/VEVU8L/1 — The fourth bird, Apr 20 '20 at 10:12
Maybe `(?<![\d$])(?<!\d,)(?:\d+(?:-\d+)+|\d{6,})(?![\d,])` will do? — Wiktor Stribiżew, Apr 20 '20 at 10:18
With 6 digits instead of the number 6 `(?<!\S)(?:\$(?:\d+(?:\,\d+)?)|(\d+(?:-\d+)+|\d{6,}))(?!\S)` https://regex101.com/r/YW6Md5/1 — The fourth bird, Apr 20 '20 at 10:18
Please see [ask] a question with a [mcve] and include sample data and expected output. That would help identify which regex is best to use. I suspect you are using `Python`? — JvdV, Apr 20 '20 at 10:20
Yes its almost accurate, but I want to ignore any number preceding with dollar+space too eg. $123654, $ 123654 — Pokemon, Apr 20 '20 at 10:22
You could match 0 or more whitespace chars after matching the dollar sign `(?<!\S)(?:\$\s*(?:\d+(?:\,\d+)?)|(\d+(?:-\d+)+|\d{6,}))(?!\S)` https://regex101.com/r/R5C0bl/1 It matches all values, but the values you are looking for to keep are in group 1 (Highlighted in green on regex101) — The fourth bird, Apr 20 '20 at 10:25
Its for a image to text conversion program. so I want to ignore secure information like SSNs, a/c numbers... so the image can be handwritten documents. so Just added that criteria forecasting the issue when adding a space between numbers in the handwritten doc..it can be avoided too — Pokemon, Apr 20 '20 at 10:30
Hi @Thefourthbird 123 456 123 456 789 can be avoided. If it is required in future, how can I add that condition? — Pokemon, Apr 20 '20 at 10:32
You could change the quantifier from `{6,}` to `{3,}` https://regex101.com/r/pIHm5Q/1 — The fourth bird, Apr 20 '20 at 10:39
(?<!\S)(?:\$\s*(?:\d+(?:\,\d+)?)|(\d+(?:[ -]\d+)+|\d{3,}))(?!\S) looks coo except that its not ignoring amounts.. $ and $ patterns..I don't want them at all.. — Pokemon, Apr 20 '20 at 10:44
@PraphulNangeelil See an example in Python https://ideone.com/FzFLrF — The fourth bird, Apr 20 '20 at 10:51
I've updated with one more condition that I missed. ie to incorporate a full stop at the end, — Pokemon, Apr 20 '20 at 10:51
If it is for both the hyphenated digits and the digits with spaces, you can make the dot optional `(?<!\S)(?:\$\s*(?:\d+(?:\,\d+)?)|(\d+(?:[ -]\d+)+\.?|\d{3,}))(?!\S)` https://regex101.com/r/i4uxZz/1 — The fourth bird, Apr 20 '20 at 10:57
ya..got it..but how to get the green color group of the result set only...cos $ and $ are in blue.. i dont want to consider them... — Pokemon, Apr 20 '20 at 11:01
@PraphulNangeelil I have added an answer with 2 demo links. I will cleanup the comments a bit as it is a long list — The fourth bird, Apr 20 '20 at 11:10

The fourth bird · Accepted Answer · 2020-04-20T11:18:42.093

1

You could match what you don't want and capture in a group what you want to keep.

Using re.findall the group 1 values will be returned.

Afterwards you might filter out the empty strings.

(?<!\S)(?:\$\s*\d+(?:\,\d+)?|(\d+(?:[ -]\d+)+\.?|\d{3,}))(?!\S)

In parts

(?<!\S) Assert a whitespace boundary on the left
(?: Non capture group
- \$\s* Match a dollar sign, 0+ whitespace chars
- \d+(?:\,\d+)? Match 1+ digits with an optional comma digits part
- | Or
- ( Capture group 1
  - \d+ Match 1+ digits
  - (?:[ -]\d+)+\.? Repeat a space or - 1+ times followed by an optional .
  - | Or
  - \d{3,} Match 3 or more digits (Or use {6,} for 6 or more
- ) Close group 1
) Close non capture group
(?!\S) Assert a whitespace boundary on the right

Regex demo | Python demo | Another Python demo

For example

import re

regex = r"(?<!\S)(?:\$\s*(?:\d+(?:\,\d+)?)|(\d+(?:[ -]\d+)+\.?|\d{3,}))(?!\S)"

test_str = ("123456\n"
    "1234567890\n"
    "12345\n\n"
    "12,123\n"
    "etc...)

print(list(filter(None, re.findall(regex, test_str))))

Output

['123456', '1234567890', '12345', '1-2-3', '123-456-789', '123-456-789.', '123-456', '123 456', '123 456 789', '123 456 789.', '123 456 123 456 789', '123', '456', '123', '456', '789']

edited Apr 20 '20 at 11:18

answered Apr 20 '20 at 11:08

The fourth bird

154,723
16
55
70

my current requirement is to use the result in `if(re.match(regex, field.value.text.lower())):` this returns all matches and groups.. I can not use the re.findall() here..I want only the group1 result in the re.match() – Pokemon Apr 21 '20 at 07:38
[re.match](https://docs.python.org/3/library/re.html#re.match) returns a [match object](https://docs.python.org/3/library/re.html#match-objects) from which you can get the [group](https://docs.python.org/3/library/re.html#re.Match.group) – The fourth bird Apr 21 '20 at 07:44
You mean, `re.match(regex, field.value.text.lower()).group` like this? – Pokemon Apr 21 '20 at 07:48
Like `.group(1)` See this page for an example https://stackoverflow.com/questions/2703029/why-isnt-the-regular-expressions-non-capturing-group-working – The fourth bird Apr 21 '20 at 07:53
gotcha! in our case, our matching results are in group(1), right? – Pokemon Apr 21 '20 at 07:56
That is correct. Note that re.match matches `If zero or more characters at the beginning of string` Else you could look at [re.search](https://docs.python.org/3/library/re.html#re.search) – The fourth bird Apr 21 '20 at 07:59
How can I add the condition to add these type of patterns `+123456565 + 12345675` – Pokemon Apr 21 '20 at 12:45
@PraphulNangeelil Hi there, sorry for the late reponse. You can do it like this https://regex101.com/r/yDzRU3/1 You can prepend a optional group with a `+` and an optional space before it `(?:\+ ?)?\d{3,}` – The fourth bird Apr 22 '20 at 08:13
if I want to add any other characters in the future, suppose I want to add @ -------- `(?:\+ ?)?(?:\@ ?)?\d{3,}` is this correct? `(?<!\S)(?:\$\s*(?:\d+(?:\,\d+)?)|(\d+(?:[ -]\d+)+\.?|(?:\+ ?)?(?:\@ ?)?\d{6,}))(?!\S)` – Pokemon Apr 22 '20 at 08:28
This part `(?:@ ?)?` will accept an `@` followed by an optional space. If you want to allow more characters, you could use a character class allowing any of the listed llike `(?:[@+] ?)?` – The fourth bird Apr 22 '20 at 08:34
but it fails when the patter is +123-456-565 – Pokemon Apr 22 '20 at 08:46
1

If you want it for both the alternations, you could prepend it before the alternation https://regex101.com/r/73VCnN/1 else you have to add it per alternative what you would allow. https://regex101.com/r/tpKoXT/1 Note that you are extending the original question, and accounting for all the side effects will make the pattern larger. – The fourth bird Apr 22 '20 at 08:53
1

Yes,, the basic requirement has been satisfied by you. but these are the issues that I'm anticipating. that's why I clarified all my queries. Thanks a lot. You helped a lot. – Pokemon Apr 22 '20 at 08:57

How to identify a series of numbers inside a paragraph

1 Answers1