2

I have a regex in PHP that replaces everything I don't want with spaces

/[^a-z0-9\p{L}]/siu

But there is this one exception, I want to keep punctuations for abbreviations.

Example:

F.B.I.Federal.Bureau.of.Investigation => 'F B I Federal Bureau of Investigation'

S.W.A.T.Team => 'S W A T Team'

Should be:

F.B.I.Federal.Bureau.of.Investigation => 'F.B.I. Federal Bureau of Investigation'

S.W.A.T.Team => 'S.W.A.T. Team'

PHP code:

$s = "F.B.I.Federal.Bureau.of.Investigation";
return preg_replace('/[^a-z0-9\p{L}]/siu', " ", $s);

so the logic is, that it should check the second char of first match, and if it's an '.' char, then don't replace. Not sure if this is possible with regex, then I would appreciate an alternative with PHP.

Rumplin
  • 2,703
  • 21
  • 45

1 Answers1

0

Actually, there are many types of abbreviations, and as Jon Stirling says, there is no really 100% working solution here since you need a whole list of possible abbreviations to filter out. You may have a peek at some fancy regex solution by @ndn and grab the pattern part related to abbreviations there.

If you need to only handle patterns like in the question, you may consider using

'~(\b(?:\p{Lu}\.){2,})|[^0-9\p{L}]~u'

or - if D.Word should also be treated as an abbreviation:

'~(\b(?:\p{Lu}\.)+)|[^0-9\p{L}]~u'

and replace with '$1 '. See the regex demo.

Pattern details:

  • (\b(?:\p{Lu}\.)+) - Group 1 (later referenced with $1 backreference): 1 or more consequent occurrences of any Unicode uppercase letter and a dot after it
  • | - or
  • [^0-9\p{L}] - any char that is not an ASCII digit and a Unicode letter.

And here is a variant of a regex with @ndn's abbreviations:

'~\b((?:[Ee]tc|St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd|pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs|\p{Lu}(?:\.\p{Lu})+)\.)|[^0-9\p{L}]~'

See the regex demo.

If you do not want to remove -, ( and ), just make sure to add them to the negated character class, replace [^0-9\p{L}] with [^0-9\p{L}()-].

Feel free to update by adding more abbreviations or enhance by shrinking the alternatives.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563