6

I have a text similar to the text below. It contains a 4 digits number that follows either digit- or whitespace and is followed by either ., ?, -digit or whitespace.


I need to match all of the digits in the first paragraph but none in the second since those digits do not meet my conditions.

Lorem ipsum 3400-digit, sit amet 5000 consectetur adipisicing elit. Natus, explicabo 6700? Itaque iure ipsum laboriosam, ex nemo delectus iste quia cupiditate digit-9134? Iste nam digit-2456 at voluptate est 8456-digit? At excepturi quis voluptatibus 7500.

Lorem ipsum $5000 dolor sit amet consectetur adipisicing elit. Obcaecati tempora dolorum repellat reiciendis cum soluta deserunt ex voluptatibus, nam illum veniam £5550 quidem aperiam sequi, nostrum sed? Quidem eveniet maiores #5550 autem. https://codepen.io/pen/5000/3454


There are a few similar questions already on StackOverflow. I have gone through some of them(links below), but I still can not do this. Please before marking this question as duplicate, check if your solution finds all the occurrence of the 4 digits number in the first paragraph but none in the second paragraph.

mahan
  • 12,366
  • 5
  • 48
  • 83

2 Answers2

9

You may use the following pattern:

/(?:\bdigit-|\s|^)(\d{4})(?=[.?\s]|-digit\b|$)/gi

See the regex demo. You need to get all Group 1 values.

Details

  • (?:\bdigit-|\s|^) - either digit- (as a whole word), whitespace or start of string
  • (\d{4}) - Group 1: four digits
  • (?=[.?\s]|-digit\b|$) - immediately to the right, there must be a ., whitespace, ? , -digit (as a whole word) or end of string. NOTE Without a lookahead, consecutive whitespace-separated matches will be left out.

JS demo:

var strs = ["Lorem ipsum 3400-digit, sit amet 5000 consectetur adipisicing elit. Natus, explicabo 6700? Itaque iure  ipsum laboriosam, ex  nemo delectus iste quia cupiditate digit-9134? Iste nam digit-2456 at voluptate est 8456-digit? At excepturi quis voluptatibus 7500.", "Lorem ipsum $5000  dolor sit amet consectetur adipisicing elit. Obcaecati tempora dolorum repellat reiciendis cum soluta deserunt ex voluptatibus, nam illum veniam £5550 quidem aperiam sequi, nostrum sed? Quidem eveniet maiores #5550 autem. https://codepen.io/pen/5000/3454" ];
var rx = /(?:\bdigit-|\s|^)(\d{4})(?=[.?\s]|-digit\b|$)/gi;
for (var s of strs) {
   var m, res =[];
   while(m=rx.exec(s)) {
     res.push(m[1]);
   }
   console.log(res);
}
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I have to more cases: text starts and ends with a 4 digits number. How to include these two cases in the regex expression. Sorry, I found this after testing (: – mahan Jul 10 '18 at 13:02
  • @mahan What are these test cases? My current regex [handles these cases](https://regex101.com/r/5i5fHg/3). – Wiktor Stribiżew Jul 10 '18 at 13:04
  • A 4 digits number follows `>` and is followed by `<`. How to include these two cases in the regex expression. (: https://regex101.com/r/DoRRXd/1 – mahan Jul 10 '18 at 13:17
  • @mahan Like [`(?:\bdigit-|[\s>]|^)(\d{4})(?=[.?\s<]|-digit\b|$)`](https://regex101.com/r/DoRRXd/2) – Wiktor Stribiżew Jul 10 '18 at 13:19
  • I have still one more case. The number is follows and is followed by ` `. Thanks a lot for your time (: – mahan Jul 10 '18 at 13:40
  • This works well. But isn't it possible to omit those prefixes and suffixes with regex? I want to match all 4 digits numbers if they have those prefixes and suffixes, but then returns only the number . – mahan Jul 10 '18 at 13:54
  • 2
    @mahan The lookbehind `(?<=...)` construct is not yet supported by all JS environments, but all those supporting ECMAScript 2018 will work with [`(?<=\bdigit-| |[\s>]|^)(\d{4})(?=[.?\s<]|-digit\b| |$)`](https://regex101.com/r/DoRRXd/4). If you only work with Chrome/Node.JS, it will work. – Wiktor Stribiżew Jul 10 '18 at 13:57
1
(\s|digit-)([0-9]{4})(?=-digit|\.|\?|\s)

You need an OR statement at the beginning and end of your query, with four digits in the middle.

To explain further:

  • (?!\s|digit-) - negative lookahead: either whitespace or digit-
  • [0-9]{4} - a number from 0 to 9, exactly four times
  • (?=-digit|\.|\?|\s) - positive lookahead: either -digit, a . (escaped because . is a special character in Regex), a question mark (also escaped for the same reason), or whitespace.

Play around on Regex101

James Whiteley
  • 3,363
  • 1
  • 19
  • 46
  • Without a lookahead, consecutive whitespace separated matches will be left out. You have an obligatory `\s` pattern in the first and third groups. – Wiktor Stribiżew Jul 10 '18 at 10:39