3

In one table I have a full address which I have to split into the city and postcode to store in other columns.

Here is sample address:

address= 'Marco Polo street 8a, 44000 Vienna'

Some rows may contain multiple addresses but I have to keep only the last portion after the last comma (44000 Vienna).

So the final result should be:

 post_office = 44000 
 city = Vienna

Tried so far: https://stackoverflow.com/a/13430958/2119941

with nested substrings, like this:

city = SUBSTRING(SUBSTRING(address, CHARINDEX(' ', address, 0) + 1, LEN(address)), CHARINDEX(',', address, 0) + 1, LEN(address))

and

post_office = SUBSTRING(SUBSTRING(address, CHARINDEX(',', address, 0) + 1, LEN(address)), 0, CHARINDEX(' ', address, 0))

But this only partially works since cities can also have multiple words like:

city = '44300 Vienna Old Town'

The rule for my data is: find the last comma, take all numbers after the comma and define that as post_office and then, after the first space after post_office until rest of the string it should be defined as the city.

I'm running Microsoft SQL Server 2017 (v14.0.3391.2 - X64)

buddemat
  • 4,552
  • 14
  • 29
  • 49
Hrvoje
  • 13,566
  • 7
  • 90
  • 104
  • 3
    Don't store such data if you care about the parts. It's not only that this breaks the most fundamental rule - SQL, the language is terrible at string manipulation. Understanding an address isn't trivial either. *Nothing* says that all addresses will have that form. Some will have a comma after the post code, others will have a different order. Clean up the data *before* loading it – Panagiotis Kanavos Nov 25 '21 at 12:22
  • 3
    *"some rows may contain multiple addresses"* sounds even more flawed; you're storing the full address in a single column, instead of separate columns for each part, *and* you're sometimes storing multiple address in a single row too. It's "doubly" denormalised. – Thom A Nov 25 '21 at 12:24
  • `cities can also have multiple words` precisely. Even with a regular expression you wouldn't be able to just capture all the parts. Someone may enter `Vienna 44000` too. Or include both the suburb and city name, eg `District 2, Vienna` or `Leopoldstadt, Vienna` – Panagiotis Kanavos Nov 25 '21 at 12:25
  • What happens when someone has stored it as `..., 44000, Vienna`? Or someone has accidentally used O instead of 0? – Stu Nov 25 '21 at 12:27
  • 1
    What you want isn't trivial. It's a common problem in data science, and part of both geocoding and natural language processing. There are libraries like libportal an [PyPortal](https://github.com/openvenues/pypostal) that parse addresses, as well as cloud services. SQL Server 2017 can call Python scripts [as part of its machine learning features](https://learn.microsoft.com/en-us/sql/machine-learning/tutorials/quickstart-python-create-script?view=sql-server-ver15) to process data which means you can install PyPortal to parse addresses – Panagiotis Kanavos Nov 25 '21 at 12:37
  • Do you handle addresses outside of Austria? In Lithuania, for example, postcodes start with the `LT-` prefix. In Ireland "Eircode" postcodes are seven alphanumeric letters. – AlwaysLearning Nov 25 '21 at 12:45
  • Thank you all for your comments. I agree, I'll try to clean up data in Python before loading it. It's my preferred way od doing things anyways. But having db with 10 million rows of such data I was hoping for sql solution. Data is not appearing in any other format and I've listed all possible anomalies. – Hrvoje Nov 25 '21 at 13:18
  • 1
    If you are sure about the format, my answer should work for you. – buddemat Nov 25 '21 at 13:45

2 Answers2

5

Well, I guess the comments below your question have already gone in detail into why the way the data is stored in is a bad idea and in what ways what you attempt is likely to fail if your assumptions about the addresses are not 100% accurate.

That being said, you might be in the process of trying to remedy that and if you are sure that the rule you defined always fits, you can do what you want.

To achieve that, use the REVERSE() function twice to find the last occurrence of a substring. This way, you can separate everything starting at the last comma. Use the TRIM() function on that to remove surrounding whitespace that may occur, e.g. if there are spaces directly following the comma.

If you then are sure there are only numeric characters and spaces in the postcode and the postcode follows immediately, you can use PATINDEX together with LEFT() and RIGHT() to find the first non-numeric character, similar to this SO post.

Full solution: For the input data

address
Marco Polo street 8a, 44000 Vienna
Marco Polo street 8a, 44000 Vienna Old Town
Marco Polo street 8a,44000 Vienna
Marco Polo street 8a, Marco Polo street 8b, 44 000 Vienna Old Town
Marco Polo street 8a, 44 00 0 Vienna

the statement

with tmp as (
select TRIM(REVERSE(SUBSTRING(REVERSE(address),0,CHARINDEX(',',REVERSE(address))))) lastpart
  from addresstable
  ) 
  select LEFT(lastpart, PATINDEX('%[^0-9 ]%', lastpart) -1) as postcode,
         RIGHT(lastpart, len(lastpart) - PATINDEX('%[^0-9 ]%', lastpart) + 1) as city
  from tmp

produces the result

postcode city
44000 Vienna
44000 Vienna Old Town
44000 Vienna
44 000 Vienna Old Town
44 00 0 Vienna

Notes:

  • if you don't want to allow spaces, remove them from the expression in PATINDEX(), i.e. '%[^0-9]%'
  • if you want your postcode to be of type int, use TRY_CONVERT() together with REPLACE() and NULLIF(). See this SO answer and dbfiddle below. Be advised that this removes possible leading zeroes that may belong to the postcode.

Alternatively, if you are sure that whatever string follows after the last comma until the next whitespace positively is the postcode and you want to allow alphabetic characters in the postcode, your attempt from before already does it. This will however break if your postcode has spaces.

with tmp as (
select TRIM(REVERSE(SUBSTRING(REVERSE(address),0,CHARINDEX(',',REVERSE(address))))) lastpart
  from addresstable
  ) 
  select SUBSTRING(lastpart,0,CHARINDEX(' ',lastpart,0)+1) as postcode,
         SUBSTRING(lastpart,CHARINDEX(' ',lastpart,0)+1,LEN(lastpart)) as city
  from tmp

See this db<>fiddle for a comparison of the two, including a few examples of addresses that will not work with one of the two.

buddemat
  • 4,552
  • 14
  • 29
  • 49
0
address= 'Marco Polo street 8a, 44000 Vienna' ;   Var 
 newadd=Split.address(,);  Var pincode =split.newadd( );   pincode[0]// is pincode 44000      pincode[1]// vienna
KumarHarsh
  • 5,046
  • 1
  • 18
  • 22
Pritam
  • 61
  • 7
  • 2
    Thank you for contributing an answer. Would you kindly edit your answer to include an explanation of your code? That will help future readers better understand what is going on, and especially those members of the community who are new to the language and struggling to understand the concepts. That's especially important when there's already an accepted answer that's been validated by the community. Under what conditions might your approach be preferred? Are you taking advantage of new capabilities? – Jeremy Caney Nov 26 '21 at 00:58