2

I am trying to put together a regex that would extract me the surface from the below strings, excluding the values that are preceded with Japanese characters.

"110.94m2・129.24m2"; --> 110.94m2 and 129.24m2
"81.95m2(24.78坪)、うち2階車庫8.9m2" --> 81.95m2
"80.93m2(登記)" --> 80.93m2
"93.42m2・93.85m2(登記)" --> 93.42m2 and 93.85m2
"81.82m2(実測)" --> 81.82m2
"81.82m2(実測)、うち1階車庫7.82m2" --> 81.82m2
"90.11m2(実測)、うち1階車庫8.07m2" --> 90.11m2

So far I have put together the following regex, however not working in every case.

(?<![\u4E00-\u9FAF\u3040-\u3096\u30A1-\u30FA\uFF66-\uFF9D\u31F0-\u31FF])([0-9\.]*m2)

ie. the following string yields: 81.95m2 and .9m2. I would need only 81.85m2.

"81.95m2(24.78坪)、うち2階車庫8.9m2"

Would you know how to treat the following block of the negative look ahead as an exclusion?

Thank you

Goul
  • 573
  • 2
  • 5
  • 16
  • 1
    You have to add `\d.` into your exclusion class: `(?<![\u4E00-\u9FAF\u3040-\u3096\u30A1-\u30FA\uFF66-\uFF9D\u31F0-\u31FF\d.])([0-9\.]*m2)` (or `0-9\.`, if your dialect is forcing you to do that :P ) – Amadan Sep 06 '19 at 08:04

1 Answers1

2

You need to cancel any match if preceded with a digit or digit + period.

Add (?<!\d)(?<!\d\.) after or before the first lookbehind:

(?<![\u4E00-\u9FAF\u3040-\u3096\u30A1-\u30FA\uFF66-\uFF9D\u31F0-\u31FF])(?<!\d)(?<!\d\.)(\d+(?:\.\d+)?m2)

See the regex demo

The (?<!\d) is a negative lookbehind that fails the match if there is a digit immediately to the left of the current location and (?<!\d\.) fails when there is a digit and a dot right before.

The \d+(?:\.\d+)? is a more precise pattern to match numbers like 30 or 30.5678: 1 or more digits followed with an optional sequence of . and 1+ digits.

NOTE that this regex will only work with the ES2018+ JS environments (Chrome, Node). You may capture an optional Japanese char into Group 1 and the number into Group 2, then check if Group 1 matched and if yes, fail the match, else, grab Group 2.

The regex is

/([\u4E00-\u9FAF\u3040-\u3096\u30A1-\u30FA\uFF66-\uFF9D\u31F0-\u31FF])?(\d+(?:\.\d+)?m2)/g

See usage example below.

JS ES2018+ demo:

const lst = ["110.94m2・129.24m2", "81.95m2(24.78坪)、うち2階車庫8.9m2", "80.93m2(登記)", "93.42m2・93.85m2(登記)", "81.82m2(実測)" , "81.82m2(実測)、うち1階車庫7.82m2", "90.11m2(実測)、うち1階車庫8.07m2"];
const regex = /(?<![\u4E00-\u9FAF\u3040-\u3096\u30A1-\u30FA\uFF66-\uFF9D\u31F0-\u31FF])(?<!\d)(?<!\d\.)(\d+(?:\.\d+)?m2)/g;
lst.forEach( s => 
  console.log( s, '=>', s.match(regex) )
);
console.log("Another approach:");
lst.forEach( s => 
  console.log(s, '=>', s.match(/(?<![\p{L}\d]|\d\.)\d+(?:\.\d+)?m2/gu))
)

JS legacy ES versions:

var lst = ["110.94m2・129.24m2", "81.95m2(24.78坪)、うち2階車庫8.9m2", "80.93m2(登記)", "93.42m2・93.85m2(登記)", "81.82m2(実測)" , "81.82m2(実測)、うち1階車庫7.82m2", "90.11m2(実測)、うち1階車庫8.07m2"];
var regex = /([\u4E00-\u9FAF\u3040-\u3096\u30A1-\u30FA\uFF66-\uFF9D\u31F0-\u31FF])?(\d+(?:\.\d+)?m2)/g;
for (var i=0; i<lst.length; i++) {
  var m, res =[];
  while (m = regex.exec(lst[i])) {
    if (m[1] === undefined) {
      res.push(m[2]);
    }
  }
  console.log( lst[i], '=>', res );
}

Variations

If you plan to match a float/int number with m2 after it that is only preceded with whitespace or at the start of the string use

(?<!\S)\d+(?:\.\d+)?m2

If you plan to match it when not preceded with any letter use

  • - (?<![\p{L}\d]|\d\.)\d+(?:\.\d+)?m2 (also works in JS ES2018+ environments: /(?<![\p{L}\d]|\d\.)\d+(?:\.\d+)?m2/gu)
  • - (?<!\d\.)(?<![^\W_])\d+(?:\.\d+)?m2

Note you may add \b word boundary after 2 to make sure there is a non-word char after it or end of string.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563