7

I'm considering a regex to restrict punctuation in city names (worldwide). What would be a fairly inclusive whitelist of these?

I'm thinking:

 (space)
. period
- hyphen
' apostrophe

Also thinking maybe comma or slash but I don't have any examples. Are there others?

User
  • 62,498
  • 72
  • 186
  • 247
  • 1
    I think that's all of them.. with the exception of city names that contain special chars like: Hōnaunau, or San Josè, etc.. But most city databases and sites that I know of do not use any of those special chars and basically just strip out those chars and use the US alphabet equivalent, such sa Honaunau, or San Jose. – Bryan Elliott Feb 26 '14 at 03:48
  • 1
    In the US, all city names (according to USPS and, I believe, the USCB) are stored as ASCII in official databases. In the world, you'd have to account for accent folding. (Maybe consider, instead of restricting input, to strip non-allowed punctuation instead... but in some languages, the accent characters do affect meaning and spelling.) – Matt Feb 26 '14 at 04:08
  • @Matt: you're right I think stripping is the better option. I'm mostly concerned with punctuation rather than letters as I plan to allow extended Latin characters – User Feb 26 '14 at 07:54
  • 2
    Just don't forget about Westward Ho! http://en.wikipedia.org/wiki/Westward_Ho! – Al Mills Feb 26 '14 at 08:57
  • 11
    _"I'm considering a regex to restrict punctuation in city names (worldwide)"_ - **why?** – Peter Boughton Mar 01 '14 at 15:22
  • 1
    I'm with @PeterBoughton, why can't you just properly escape the input? – Dan Bechard Mar 10 '14 at 20:21
  • 1
    Please don't forget [Saint-Louis-du-Ha!-Ha!](http://en.wikipedia.org/wiki/Saint-Louis-du-Ha!_Ha!) :/ – Robin Mar 26 '14 at 12:16
  • @PeterBoughton: I'm considering a whitelist of characters for a city field because I'm not sure I want to allow chinese/japanese/arabic characters in a city field (because I cannot read them). Since I must specify allowed characters for this I also need specify punctuation. – User Mar 26 '14 at 14:07
  • 4
    @User And if you can't read its name, the city doesn't exist or doesn't have inhabitants? – Chris Wesseling Mar 30 '14 at 23:26

2 Answers2

2

This is the most inclusive whitelist of punctuation to be found in city names. The ASCII apostrophe codepoint may not be the one used when someone is entering an apostrophe on their keyboard.

If you've discerned the encoding of the submitted text, you should be able to see if it falls under the Punctuation block:

/\p{InGeneral_Punctuation}/

If you are limiting yourself to Latin-Extended, just use those:

/\p{InLatin_Extended-A}/

Also, ask yourself: What are the consequences of someone putting a funny character into my city name? Is that worse than the consequences of someone not being able to enter their correct address, if I exclude too much?

heptadecagram
  • 908
  • 5
  • 12
2

USPS standard address formatting calls for stripping all special characters except 'necessary' hyphens and dashes used in the primary and/or secondary street address lines and hyphens in the ZIP.

So if an address is:

John O'Toole
456 N 4-1/2 St
San José, CA 99999-4545

The post office prefers envelopes be labeled:

John O Toole
456 N 4 1/2 St
San Jose CA 9999-4545
arghtype
  • 4,376
  • 11
  • 45
  • 60