0

I'm trying to find a way of splitting the address column into two in order to separate the street number, street name, city and state.

SELECT distinct
            OWNER,
            ADDRESS,

FROM vw_EMPLOYEE

This gives me a result like this

OWNER   ADDRESS           
JOHN    10 HILL MIAMI FL 33139
TONY    4545 BIRD AVE DORAL FL 33134

What would it be the best option to split this column in 5 different columns. I was reading about functions or sub-substring but I'm a little confused because I'm still learning about SQL

sergio089
  • 143
  • 1
  • 1
  • 8
  • Unless there are distinct delimiters, parsing addresses like that is very difficult. Why does one address have a state, but the other doesn't? – Barmar Mar 02 '17 at 22:12
  • You should revise your table schema to have separate columns for each part of the address. Then you'll probably need humans to re-enter the data properly. – Barmar Mar 02 '17 at 22:13
  • 1
    [*T-SQL (Transact Structured Query Language) is the extension of SQL functionality supported by Sybase ASE and Microsoft SQL Server. **Do not use this tag for MySQL, PostgreSql, Oracle(Pl/SQL) related queries.***](http://stackoverflow.com/tags/tsql/info) – shmosel Mar 02 '17 at 22:20
  • This is a slippery slope. Take a peek at http://stackoverflow.com/questions/41249742/address-standardization-within-a-database/41249971#41249971 – John Cappelletti Mar 02 '17 at 23:21
  • Barman, sorry about that, the address in the second record does have a state, I'll edit the post. Shmosel, ok. I was thinking about using substring but because of the amount of characters in the street number is different, that might not work right because it needs the amount of characters? SUBSTRING( string, start_position, length ) – sergio089 Mar 03 '17 at 13:55

6 Answers6

1

You could use this udf and also detailed in this answer to your question to split the address field by space and then work with each element. However there are problems with this:

  • Your addresses are not consistent (some have state others dont) - it's messy.
  • The udf and sql do not allow you to easily loop through each word in the address regardless of the number of words in the address (the UDF requires you specify the "word number").

If I had this problem I would use a scripting language rather than SQL to do the splitting and looping through each word in the address field. Following a process like this:

  1. Source a list of zipcodes into a table.

  2. Source a list of states and state abbreviations into a table.

  3. Source a list of cities into a table.

  4. Source a list of road names and abbreviations into a table (eg. road, rd, street, st).

  5. Split the address by space, then looping from the last address word to the first:

    a. If the word exists in the zipcode table, then that is the zipcode. If a zipcode has already been identified for the current address then error.

    b. Else, if the word exists in the states table, then that is the state. If a state has already been identified for the current address then error.

    c. Else, if the word exists in the cities names table, then that is the city. If a city has already been identified for the current address then error.

    d. Else, if the word exists in the road names table, then the current word and all previous words would be assumed to be the address line.

    e. Outside of the if/else block outlined in a-d, if zipcode, state and city are all identified for the current address, then assume that the remaining words form address line 1.

The above process will give you a start to a problematic cleansing journey.

Some additional issues you may need to address:

  • If addresses happen to have "," seperating field elements you may need to strip them. eg: 10 HILL MIAMI, FL, 33139
Community
  • 1
  • 1
joshweir
  • 5,427
  • 3
  • 39
  • 59
1

you can create this function:

CREATE FUNCTION SPLIT_STR(
  x VARCHAR(255),
  delim VARCHAR(12),
  pos INT
)
RETURNS VARCHAR(255)
RETURN REPLACE(SUBSTRING(SUBSTRING_INDEX(x, delim, pos),
       CHAR_LENGTH(SUBSTRING_INDEX(x, delim, pos -1)) + 1),
       delim, "");

query:

select distinct
            OWNER,prod_id,SPLIT_STR(ADDRESS,' ',1) as pos1 ,SPLIT_STR(ADDRESS,' ',2) as pos2,
SPLIT_STR(ADDRESS,' ',3) as pos3 ,SPLIT_STR(ADDRESS,' ',4) as pos4,SPLIT_STR(ADDRESS,' ',5) as pos5
from vw_EMPLOYEE
Danilo Bustos
  • 1,083
  • 1
  • 7
  • 9
0

Splitting is lot easier when delimeters are there. With the help of SUBSTRING and PATINDEX.

Since your ADDRESS column doesn't have anything like that, you can go for this. This will first convert into xml and then from there you can achieve the value in 5 different columns.

;WITH Split_Address (Owner,Address, xmladd)
 AS
(
    SELECT Owner,
    Address,
    CONVERT(XML,'<Adrdress><Adrdress>'  
    + REPLACE(Adrdress,' ', '</Adrdress><Adrdress>') + '</Adrdress></Adrdress>') AS xmladd
     FROM (select Adrdress,Owner from vw_EMPLOYEE)as t
)

 SELECT Value,      
    xmlname.value('/Adrdress[1]/Adrdress[1]','varchar(100)') AS Adrdress1,    
    xmlname.value('/Adrdress[1]/Adrdress[2]','varchar(100)') AS Adrdress2,
    xmlname.value('/Adrdress[1]/Adrdress[3]','varchar(100)') AS Adrdress3,
    xmlname.value('/Adrdress[1]/Adrdress[4]','varchar(100)') AS Adrdress4,
    xmlname.value('/Adrdress[1]/Adrdress[5]','varchar(100)') AS Adrdress5
 FROM Split_Address 
Kinchit Dalwani
  • 398
  • 4
  • 19
0

I think there is no guarantee that you always have 5 section. Address like 123 Santa Rosa Blvd. CA 91035 has six part.

In general

  1. Everything before first space, can be the NUMBER
  2. Everything before last space can be ZIPCODE
  3. After removing these 2 parts, you will have the last 2 character as STATE
  4. The rest will be STREET.

Hope this lead you to an algorithm.

FLICKER
  • 6,439
  • 4
  • 45
  • 75
0

Thank you all for the feedback. All the responses have helped me and lead me to a correct direction. I think the best would be the function creation.

sergio089
  • 143
  • 1
  • 1
  • 8
0

Just adding some more feedback...

You should be prepared to parse the address in order of POSSIBLE values. Here is the possible breakdown

**

PreCoor Coor PostCoor StreetNumber Direction StreetName StreetSuffix PostDirection UnitType UnitNumber

**

PreCoor, PostCoor, Direction and PostDirection are DIRECTIONS (N,S,E or W)

Strip address from any commas and periods first

You'll have to regexp or loop your way through the possible Unit names (Unit, #, APT etc...) and the possibility that the following Unit_Number can be numeric [0-9] or letter based [a-z] or a combination of both. They also can be a length of 1-5 char commonly.

You'll also nee to check for possible PO BOX in address. If that's found all else is doesn't matter.

Normally, City, State and Zip are provided in their own fields.

Example on spliting the UNIT mySQL

REPLACE(REPLACE(REPLACE(ADDR,'.',''),'-',''),',','') REGEXP '( UNIT| APT| #| SUITE| STE| LOT) ?([0-9]{1,5}$|[a-z]{1,4}$)';

php

preg_match('/( UNIT| APT| #| SUITE| STE| LOT) ?([a-z]{1,3}|[0-9]{1,3})/i', $input_line, $output_array);

Chris Lambrou
  • 356
  • 2
  • 7