0

There are a lot of posts on how to split a single column value into multiple columns, however I am having trouble standardizing data in the column to be able to use those techniques. What is the best approach for standardizing the below example?

For example:

**emails**
first_email,second_email
third_email; fourth_email
fifth_email ;sixth_email
seventh_email, eight_email
ninth_email.tenth_email

My expected output from standardization is:

**emails**
first_email, second_email
third_email, fourth_email
fifth_email, sixth_email
seventh_email, eight_email
ninth_email, tenth_email
Jen
  • 635
  • 5
  • 9
  • 4
    If possible, I would highly suggest changing your database design to store only a single email in each row. Another issue is that periods are allowed characters in email addresses. So splitting on them wouldn't necessarily produce your expected results. – sgeddes Aug 10 '18 at 17:03
  • Sadly, this is data from a database that I do not have control over and so have to restructure to fit into a better database design. – Jen Aug 10 '18 at 17:06
  • 1
    A comma or semicolon are valid separators for multiple email, but a period is not (since it is commonly used as `first.last@domain.com`). Is `.` really a separator between two email addresses?, or is `.` part of a valid email address? – James L. Aug 10 '18 at 17:11
  • 1
    I would suggest using regular expressions to parse the list. I doubt it's possible without regexp. – Ildar Akhmetov Aug 10 '18 at 17:12
  • Thanks, and yeah there are some emails with . as a separator. – Jen Aug 10 '18 at 17:13
  • In your sample above, are there only two emails per row in the single column? Or are all 10 in the same tuple (“cell”)? – Stuart Ainsworth Aug 10 '18 at 17:14
  • 1
    Are there other email where the `.` is part of the email address, or are all `.` used to separate two email addresses? – James L. Aug 10 '18 at 17:16
  • There are up to 5 emails per row in the column, they are not in one cell. There are other emails where the `.` is part of the email address. – Jen Aug 10 '18 at 17:19
  • 1
    How many records are we talking about? Can you manually sort through a few stragglers if necessary? The `,` and `;` are no problem, the `.` on the other hand will cause you a lot of grief. – Dave C Aug 10 '18 at 17:21
  • If the `.` wasn't used to sometimes be part of an email and other times separate two email, it would be easy. I'll post an example. – James L. Aug 10 '18 at 17:25
  • That might have to be the approach there are around 26 thousand records. I am probably going to have to manually sort through the stragglers. – Jen Aug 10 '18 at 17:26
  • If at least there was space after the separating `.`! But it should be easy to find those with `email LIKE '%@%@%'`. This finds records with 2 or more `@` once you have separated the begnin cases. – Olivier Jacot-Descombes Aug 10 '18 at 17:26

1 Answers1

1

This shows how you can normalize the formatting, but because the . character is sometimes used in an email address and other times used to separate email addresses, you should probably not include the replace for the ., and just work through those manually.

You can used a nested replace() to eliminate the spaces first, then replace the valid email address separators with ,[space].

declare @str varchar(max) = 'first_email,second_email
third_email; fourth_email
fifth_email ;sixth_email
seventh_email, eight_email
ninth_email.tenth_email'

select replace(replace(replace(replace(
         @str, ' ',   ''), -- a space is not valid in an email
               ',', ', '),
               ';', ', '),
               '.', ', '
       )

Which returns:

first_email, second_email
third_email, fourth_email
fifth_email, sixth_email
seventh_email, eight_email
ninth_email, tenth_email
James L.
  • 9,384
  • 5
  • 38
  • 77