None of the answers here worked exactly how I needed, so here's my solution along with some useful code.
Here's the link to Regex101:
https://regex101.com/r/17WThE
Note: I recommend generating the regex from the code down further, but if you just want the regex itself, here it is:
const poBoxRegex = /\b(((?:(P(?:ost(?:al)?)?)[ ./\-_]*(O(?:ff(?:ice)?)?)?[ ./\-_]*(b(?:o?x|in)?))|(?:(P(?:ost(?:al)?)?)[ ./\-_]*(O(?:ff(?:ice)?)?)[ ./\-_]*(b(?:o?x|in)?)?)|(?:(?<!(r(?:ural)?[ ./\-_]*r(?:oute)?)[ ./\-_]*((n\.?(?:o|um(?:ber)?)?\.?)?[ ./\-_]*#*[ ./\-_]*(\d+))?[ ./\-_]*)(box|bin)))[ ./\-_]*((n\.?(?:o|um(?:ber)?)?\.?)?[ ./\-_]*#*[ ./\-_]*(\d+)))\b/i;
Notes and features:
- Case insensitive (of course), and respects word boundaries.
- All words can all be separated by nothing, spaces, periods, slashes, dashes, or underscores. (So "PO Box 123", "P.O. Box 123", "PO-Box 123", "PO_Box123", etc. all match.)
- Some valid abbreviations are P/Post/Postal, O/Off/Office, and B/Box/Bx/Bin (as well as N/No/Num/Number* with "#" optionally). The only exception is when using only "box" or "bin", you must use the full word without abbreviation (so "box" or "bin", not "b" or "bx").
- Valid combinations (not exact words) are Post Office Box, Post Box, Post Office, or just Box (all must have numbers).
So it doesn't match things like "Post 123", "Postal 123", "Office 123", or "Mailbox 123". Currently it doesn't match with "mailbox", and there has to be more than just "post", "postal", or "office" in the address.
- Anything with a Box + number will match, with the exception of rural route addresses (like "RR 1 Box 2", "R.R. 23 Box 45", or "Rural Route Box 123"). So it also matches military APO/FPO address, like "PSC 123 Box 1234" (because of the Box + number).
- The PO Box must have a number after it in order to match. This helps prevent false positives. The number regex is very fuzzy, so it will match a lot of things, like "1", "#1", "No. # 1", "number 1", "Num 1", etc.
- One area for improvement could be adding matches for the words in other languages, such as "Postfach" (German).
Generator code and util functions: (you should probably put this into its own file/module)
// Get the string version of each regex that we need to match PO boxes
const separator = /[ ./\-_]/.source; // Any of these characters can separate the PO box words (space, period, slash, dash, underscore)
const postalGroup = /(P(?:ost(?:al)?)?)/i.source; // "P", "Post", "Postal"
const officeGroup = /(O(?:ff(?:ice)?)?)/i.source; // "O", "Off", "Office"
const boxGroup = /(b(?:o?x|in)?)/i.source; // "b", "bx", "box", "bin"
const fullBoxGroup = /(box|bin)/i.source; // "box" or "bin" (full word only)
const ruralRouteGroup = /(r(?:ural)? r(?:oute)?)/i.source; // "rr", "r r", "ruralr", "rroute", "rural route", etc. (space will be replaced with full separators later)
const numTextGroup = /(n\.?(?:o|um(?:ber)?)?\.?)/i.source; // "n", "no", "num", "number", "n.", "n.o.", "no.", "num.", etc.
const digitsGroup = /(\d+)/.source; // 1 or more digits
// Spaces in below strings will be replaced later with the separator regex
// Construct number part of PO box regex string
const poBoxNumberGroup = `(${numTextGroup}? #* ${digitsGroup})`; // Match "number # 123", "no 123", "# 123", "123", etc.
// Construct PO box words regex string
const poBoxWordsOfficeOptional = `(?:${postalGroup} ${officeGroup}? ${boxGroup})`; // Match stuff like "post box", "post office box", or "P B" (where the office part is optional)
const poBoxWordsBoxOptional = `(?:${postalGroup} ${officeGroup} ${boxGroup}?)`; // Match stuff like "post office", "post office box", or "PO" (where the box part is optional)const ruralRouteNegativeGroup = /(?!r(?:ural)? r(?:oute)? (\d+)?)/i.source; // Negative lookahead for "rr 12", "r r 1", "ruralr 1", "rroute 1", "rural route 1", etc. (space will be replaced with full separators later)
const poBoxWordsBoxOnly = `(?:(?<!${ruralRouteGroup} ${poBoxNumberGroup}? )${fullBoxGroup})`; // Match just "box" or "bin" (unless preceded by a rural route)
const poBoxWordsGroup = `(${poBoxWordsOfficeOptional}|${poBoxWordsBoxOptional}|${poBoxWordsBoxOnly})`; // Match either of the above
// Construct the whole PO box regex string (with word boundaries, but still excluding the separators)
const wholePOBoxGroup = `\\b(${poBoxWordsGroup} ${poBoxNumberGroup})\\b`;
// Construct the final PO box regex
const PO_BOX_REGEX = new RegExp(
wholePOBoxGroup.replaceAll(" ", `${separator}*`), // Replace all spaces with regex matching any number of the separators
"i" // Making global causes issues with matching since the regex is a constant
);
// Check if the address is a PO box
export function hasPOBox(addressString) {
return PO_BOX_REGEX.test(addressString);
}
// Get the PO box part from the address (first match)
export function getPOBox(addressString) {
const match = addressString.match(PO_BOX_REGEX);
return match ? match[1] : null;
}
// Get the PO box part from the address (all matches)
export function getAllPOBoxes(addressString) {
const poBoxRegex = new RegExp(PO_BOX_REGEX, "gi"); // Make global so we can get all matches
const matches = addressString.matchAll(poBoxRegex);
return Array.from(matches, (match) => match[1]);
}
// Given a string containing PO boxes, standardize all PO boxes found in the string like "PO Box 123"
export function standardizePOBoxes(text) {
const poBoxes = getAllPOBoxes(text || "");
for (const poBox of poBoxes) {
const poBoxNum = poBox.match(/\d+/)?.[0]; // Get the number part of the PO box
text = text?.replace(poBox, "PO Box " + poBoxNum); // Replace the PO box with the formatted version
}
return text;
}
The regex will be the PO_BOX_REGEX
constant. Generating the regex in this way allows you to more easily see how it works and make appropriate modifications.
Finally, below is a list of matches and non-matches I'm using in my tests:
const matches = [
"BIN 12",
"BOX 12",
"Box 12",
"Box-12",
"Box12",
"P O Box123",
"P. O. Box 13",
"P.O 123",
"P.O Box 12",
"P.O. Box 123",
"P.O. Box 123",
"P.O.B 123",
"P.O.B. 123",
"P.o box 12",
"PO 123",
"PO Box 1",
"PO Box 123",
"PO Box N 12",
"PO Box No 12",
"PO Box No. 12",
"PO Box Number 12",
"PO Box #12",
"PO Box # 12",
"PO-Box 12",
"PO. Box 12",
"POB 12",
"POB 123",
"POB1",
"POBOX123",
"Po Box 12",
"Post Box #123",
"Post Box 123",
"Post Office Box 123",
"Postal Box 12",
"box #1",
"box #123",
"box # 123",
"box 123",
"p box 12",
"p o box 12",
"p o box num 12",
"p off b 12",
"p off box 12",
"p office b 12",
"p office box 12",
"p-o box 12",
"p-o-b-1",
"p-o-box-12",
"p.o.bin 12",
"p.o box 12",
"p.o. box 12",
"p.o. box. 12",
"p.o.-box12",
"p.o.b.#123",
"p.o.b.12",
"p/o box 123",
"p/o-box 12",
"pb 12",
"po #123",
"po 12",
"po bin 123",
"po box 123",
"po box no #23",
"po box no 123",
"po box n.o. 12",
"po box num 12",
"po box num #123",
"po box number #12",
"po box number 12",
"po bx 12",
"po n 12",
"po num123",
"po-box-12",
"pob #12",
"pob num12",
"pob number12",
"pobox123",
"post o. b. 12",
"post o box 123",
"post o bx 12",
"post off b 12",
"post off. box 12",
"post office b 12",
"post office box 123",
"postal box 123",
"postal office box 12",
"postal-off-box 12",
];
const nonMatches = [
// Don't match unit numbers
"B1",
"#1",
"# 1",
"N1",
"Number 1",
"Num 1",
"No 1",
// Rural route addresses
"RR 12 Box 1020",
"RR Box 12",
"RR #12 Box 12",
"r.r. 12 box 12",
"Rural Route 12 Box 12",
"Rural Route # 12 Box 12",
"Rural Route Box 12",
"rural-route 12 box 12",
// Other street addresses
"1223 P Street #1",
"123 ABox #1", // Respect word boundary
"123 Box, #1", // This would match, but the comma prevents it
"123 Bx #1", // Must use the full word "box" when by itself
"123 Expo #1",
"123 Harpo Box Street #1",
"123 Poblano Lane #1",
"123 Poor Box Road #1",
"123 Some Street",
"123 box canyon rd",
"2 Expo Blvd #1",
"34 PO Road #1",
"777 Post Oak Blvd",
"Army Post 1",
"Box Hill",
"Controller's Office",
"Office 123",
"Post 123", // Perhaps this should match?
"Postal 123", // Perhaps this should match?
"Post Office Road 123", // "Road" before the number prevents this from matching
"Post Rd. #1",
"Postal Road #1",
"The Postal Road",
"pollo St.",
];
This isn't necessary a complete list, and may be redundant in some ways. I copied a lot of these from the comments on this question's replies, so hopefully it covers a wide range of cases.
I hope someone finds this useful, and let me know if you have any suggestions or find any problems with it.