Vim: Parsing address fields from all around the globe

Question

Intro

This post is long, but I consider it thorough. I hope this post might be helpful (addresses) to others while teaching complex VIM regexes. Thank you for your time.

Worldwide addresses:

American, Canadian and a few other countries are offered 5 fields on a form, which is then displayed in a comma delimited format that I need to further dissect. Ideally, the comma-separated content looks like:

Some Really Nice Place, 111 Street, Beautiful Town, StateOrProvince, zip

where zip can be either a series of just numbers (US) or numbers and letters (Canada).

Invariably, people throw an extra comma into their text box field input and that adds some complexity to the parsing of this data. For example:

Some Really Nice Place, 111 Street, suite 101, Beautiful Town, StateOrProvince, zip

Further complicating this parse is that the data from non-US and non-Canadian countries contains an extra comma-delimited field that was somehow provided to them - adding a place for them to enter their country. (No, there is no "US" or "Canada" field for their entries. So, it's "in addition" to the original 5 comma-delimited fields.) Such as:

Foreign Name of Building, A street name, A City, ,zip, Country

The ",," is usually empty as non-US countries do are not segmented into states. And, yes, the same "additional commas" as described above happens here too.

Foreign Name of Building, cross streets, district, A street name, A City, ,zip, Country

Parsing Strategy:

A country name will never include a digit, whereas a US or Canadian zip will always have at least some digits. If you go backwards using this assumption about the contents of the last field then you should be able to place the country, zip, State (if not empty ",,"), City and Street into their respect positions - which are the most important fields to get right. Anything beyond those sections could be lumped together in the first or or two lines as descriptions of the address (i.e. building, name, suite, cross streets, etc). For example:

Some Really Nice Place, 111 Street, suite 101, Beautiful Town, Lovely State, Digits&Letters

Last section has a digit (therefore a US or Canadian address)
There a total of 6 sections, so that's one more than the original 5
Knowing that sections 5-2 are zip, state, town, address...
6 minus 5 (original) = add an extra Address (Address2) field and leave the first section as the header, resulting in:

Header: Some Really Nice Place, Address1: 111 Street, Address2: Suite 101, Town: Beautiful Town, State/Province: Lovely State, Zip: Digits&Letters

Whereas there might be a discrepancy on where "111 Street" or "Suite 101" goes (Address1 or Address2), it at least gets the zip, state, city and address(s) lumped together and leaves the first section as the "Header" to the email address for data entry purposes.

Under this approach, foreign address get parsed like:

Foreign Name of Building, cross streets, district, A street name, A City, ,zip, Country

Last section has no digit, so it must be a Country
That means, moving right to left, the second section is the zip
So now (foreign) you have an "original 6 sections" to subtract from the total of 7 in the example
7th section = country, 6th = zip, 5th = state (mostly blank on foreign address), 4th = City, 3rd = address1, 2nd = address2, 1st = header
We knew to use two address fields because the example had 7 sections and foreign addresses have a base of 6 sections. Any number of sections above the base are added to a second address2 field. If there are 3 sections above the base section count then they are appended to each inside the address2 field.

Coding

In this approach using VIM, how would I initially read the number of comma-delimited sections (after I've captured the entire address in a register)? How do I do submatch(es) on a series of comma-delimited sections for which I am not sure the number of sections that exist?

Example Addresses

Here are some practice address (US and Foreign) if you are so inclined to help:

City Gas & Electric - Bldg 4, 222 Middle Park Ct, CP4120F, Dallas, Texas, 44984

MHG Engineering, Inc. Suite 200, 9899 Balboa Ave, San Diego, California, 92123-1502

SolarWind Turbines, 2nd Floor Conference Room, 2300 Ruffin Road, Seattle, Washington, 84444

123 Aeronautics, 2239 Industry Parkway, Salt Lake City, Utah, 55344

Ongwanda Gov't Resources, 6000 Portsmouth Avenue, Ottawa, Ontario, K7M 8A6

Graylang Seray Center, 6600 Haig Rd, Singapore, , 437848, Singapore

Lot 459, Block 14, Jalan Sultan Tengah, Petra Jaya, Kuching, , 93050, Malaysia

Virtual Steel, 1 Umgazi Rd Aspec Park, Pretoria, , 0075, South Africa

Idiom Towers South, Fifth Floor, Jasmen Conference Room, 1500 Freedom Street, Pretoria, , 0002, South Africa

The code which generates ambiguous CSV is wrong. Any field containing the delimiter should be in double quotes. (Literal double quotes should be doubled. There are a gazillion incompatible variants, though.) — tripleee, Jan 07 '12 at 19:16
The CSV code is what it is, I cannot change it. I have outlined an approach that deals with the ambiguity inherent with working from left to right, and how it can be handle appropriately working right to left - hence the long post. The "coding" questions are what I am seeking to get answered. — Ricalsin, Jan 07 '12 at 20:02
This doesn't seem like a good task for a regex and will probably be harder to do in VIM than in some simple program like python. A regex is good when you know the specific layout of your data. In this case you need fault tolerance. I'd probably just make a program to scan each line and test various hypothesis, probably made easier if you compare countries to a known country list, and split the list into good / probably good, and eyeball the probably good ones for errors. @tripleee is right that the data is poorly formatted. — Andy Ray, Jan 07 '12 at 20:45
If the question is only how to break the comma-separated record into list of fields, then something like `split(@a, ',\s*', 1)` should do the job. — ib., Jan 08 '12 at 00:53
@Andy Ray In vim you can "scan" each line by capturing it in a register and making it available for parsing (mentioned above). I have the option of breaking a very large list down by country, but every method has it's own effort and I have reason for trying to accomplish it in one list, if possible. Are you saying there's no way to count the number of commas in an address that has been captured in a vim register and perform the parsing strategy I outlined above? — Ricalsin, Jan 08 '12 at 00:55
@ib. Thank God you're here! :) Yes, that's essentially it, but I need to know the number created by that split, because that then would enable me to group the sections appropriately (solving the user-infused comma into the text field). — Ricalsin, Jan 08 '12 at 01:30
@ib. Part two would be an example of how to use an if/else statement on the command line, performing either a US/Canadian style storage or the other; which would include subtracting the total number of sections by the base (5 or 6) section per type. — Ricalsin, Jan 08 '12 at 01:35
@Ricalsin: Thank you! :-) I don't get what do you mean by "the number created by that split". Could you please clarify this term? Is that the number of comma-separated fields in a record? — ib., Jan 08 '12 at 04:05
@ib. If the LAST submatch does not have a digit in it, then it must be a Country. That would mean there is six comma-separated content fields, where the 5th element would be the zip, 4th the State/Province...and so on (as the post states). Any content fields more than six would be user-infused and therefore put them...and so on (as the post states). If the LAST content field HAS a digit in it, then there are only five content fields (plus whatever user-infused commas by the user. So, the answer is "yes", but I'm trying to clarity "why" I need it. — Ricalsin, Jan 08 '12 at 04:23
Ah, yes, it (mostly) does. I have written a draft parsing and formatting code that tries to follow the rules described in the statement. (I'm not sure that the parsing function conforms to them through and through.) Feel free to take it as a starting point for your address transformation script. — ib., Jan 08 '12 at 05:39
Sorry, I don't think any systematic approach will work really :) For a list of counterexamples, see this: https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/ — Paweł Balawender, Apr 10 '23 at 19:57

score 1 · Answer 1 · edited May 23 '17 at 10:24

1

Maybe you should review some of the other questions about addresses around the world. The USA and Canada are extraordinarily systematic with their systems; most other countries are a lot less rigorous about the approved formats. Anything you devise for the USA and Canada will run into issues almost immediately you deal with other addresses.

There are probably other related questions: see the tag street-address for some of them.

edited May 23 '17 at 10:24

Community

1
1

answered Jan 08 '12 at 00:22

Jonathan Leffler

730,956
141
904
1,278

Thank you for these links. My concern was not about creating postal addresses for mailing letters; hence the formatting of them is not the critical issue. As I mentioned up top, it's about capturing the "most important sections of (right to left) Country, Zip, State/Province, City and Address and then the Header. I made mention that the Address1 and Address2 could be allowed to be "grouped" because it's not critical in our db usage and we are not actually mailing anything. – Ricalsin Jan 08 '12 at 01:01

ib. · Accepted Answer · 2012-01-08T05:44:26.057

1

The following code is a draft-quality Vim script (hopefully) implementing the address parsing routine described in the question.

function! ParseAddress(line)
    let r = split(a:line, ',\s*', 1)
    let hadcountry = r[-1] !~ '\d'
    let a = {}
    let a.country = hadcountry ? r[-1] : ''
    let r = r[:-1-hadcountry]
    let a.zip = r[-1]
    let a.state = r[-2]
    let a.city = r[-3]
    let a.header = r[0]
    let nleft = len(r) - 4
    if hadcountry
        let a.address1 = r[-4]
        let a.address2 = join(r[1:nleft-1], ', ')
    else
        let a.address1 = r[1]
        let a.address2 = join(r[2:nleft], ', ')
    endif
    return a
endfunction

function! FormatAddress(a)
    let t = map([
    \   ['Header', 'header'],
    \   ['Address 1', 'address1'],
    \   ['Address 2', 'address2'],
    \   ['Town', 'city'],
    \   ['State/Province', 'state'],
    \   ['Country', 'country'],
    \   ['Zip', 'zip']],
    \   'has_key(a:a, v:val[1]) && !empty(a:a[v:val[1]])' .
    \       '? v:val[0] . ": " . a:a[v:val[1]] : ""')
    return join(filter(t, '!empty(v:val)'), '; ')
endfunction

The command below can be used to test the above parsing routines.

:g/\w/call setline(line('.'), FormatAddress(ParseAddress(getline('.'))))

(One can provide a range to the :global command to run it through fewer number of test address lines.)

edited Jan 08 '12 at 05:44

answered Jan 08 '12 at 05:34

ib.

27,830
11
80
100

1

Oh my, Mr. @ib. I'm studying... But I know you well enough to know this is going to work and I'll be able to implement it. In the meantime; PLEASE send me the name of a restaurant, a bar, a movie house - something! Part of the joy of web design is working with people around the world, and I want to know I'm buying someone like you a drink, a dinner, a movie - something. I'll do it through gift certificates. Let me have the joy, please. – Ricalsin Jan 08 '12 at 05:44
Where would I read to discover more about the v:val[1] thing you are using? a is a list and a:a is the function's list attribute placed in the has_key(), but I am obviously missing the v:val[] understanding. I see most everything else. @ib. Thanks. You're very good. – Ricalsin Jan 08 '12 at 07:03
PERFECTION! Amazing. You're incredible. I'll take credit for a proper strategy. :) I really appreciate learning from your coding abilities. Thank you so much! Question: Both times I have asked a difficult parsing question I have received initial advice on NOT using VIM but rather a language such as AWK or Perl. I'm sure it can be done with either, but I assume it stems from people not knowing all that can be done with VIM(?). Can you comment on the difference between the three? – Ricalsin Jan 09 '12 at 02:50
1

@Ricalsin: `v:val` is a special variable that holds the value of a list item when `map()`-expression is executed. See `:help map()`. – ib. Jan 11 '12 at 13:20
Yes, I discovered. Can you check this question out: http://stackoverflow.com/questions/8821858/vim-delete-pattern-if-submatch1-is-empty :) – Ricalsin Jan 11 '12 at 15:47
@Ricalsin: Regarding your question asking for comparison of Vim script, AWK, and Perl languages. It is too broad for a comment, I'm afraid. In my opinion, for simple text parsing and transformations they are equally handy (for one who knows them all). However, as general programming language, Perl has much more potential for writing large complicated programs, while both Vim script and AWK are more purposely built languages (although both can be used for almost anything as Turing-complete ones). – ib. Jan 12 '12 at 01:29

Vim: Parsing address fields from all around the globe

2 Answers2