0

I need to export a single column from a MySQL database which shows each entry only once. So in the following table:

id      author(s)               content
________________________________________
1       Bill, Sara, Mike        foo1             
1       Sara                    foo2
2       Bill, Sara, Mike        foo3   
2       Sara                    foo4
3       David                   foo5      
3       Mike                    foo5      

I would need to export a list of authors as "Bill, Sara, Mike, Susan" so that each name is shown only once.

Thanks!

UPDATE: I realize this may not be possible, so I am going to have to accept an exported list which simply eliminates any exact duplicates within the column, so the output would be as such: Bill, Sara, Mike, Sara, David, Mike Any help forming this query would be appreciated.

Thanks again!

Dean Smith
  • 120
  • 2
  • 7
  • 2
    See [Is storing a delimited list in a database column really that bad?](http://stackoverflow.com/a/3653574) – eggyal Jul 11 '13 at 21:36
  • Eek. If you still have control over your database structure, you should make authors a separate table, with a combined primary key/a foreign key. Then you'd use `select distinct authorName from authors`, for instance. Yeah, I'm not sure MySQL can separate your list. A programming language might be able to, but... But even if it can, I think you should enforce the rule: 1 field = 1 value. – Ariane Jul 11 '13 at 21:48
  • The data came from a third party, so unfortunately we're stuck with the authors combined in one column. – Dean Smith Jul 11 '13 at 21:58

1 Answers1

1

It's possible to get the resultset, but I'd really only do this to convert this to another table, with one row per author. I wouldn't want to run queries like this from application code.

The SUBSTRING_INDEX function can be used to extract the first, secpond, et al. author from the list, e.g.

SUBSTRING_INDEX(SUBSTRING_INDEX(authors,',', 1 ),',',-1) AS author1
SUBSTRING_INDEX(SUBSTRING_INDEX(authors,',', 2 ),',',-1) AS author2
SUBSTRING_INDEX(SUBSTRING_INDEX(authors,',', 3 ),',',-1) AS author3

But this gets messy at the end, because you get the last author when you retrieve beyond the length of the list.

So, you can either count the number of commas, with a rather ugly expression:

LENGTH(authors)-LENGTH(REPLACE(authors,',','')) AS count_commas

But it's just as easy to append a trailing comma, and then convert empty strings to NULL So, replace authors with:

CONCAT(authors,',')

And then wrap that in TRIM and NULLIF functions.

NULLIF(TRIM( foo ),'')

Then, you can write a query that gets the first author from each row, another query that gets the second author from each row (identical to the first query, just change the '1' to a '2', the third author, etc. up to the maximum number of authors in a column value. Combine all those queries together with UNION operations (this will eliminate the duplicates for you.)

So, this query:

SELECT NULLIF(TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(a.authors,','),',',1),',',-1)),'') AS author
  FROM unfortunately_designed_table a
 UNION
SELECT NULLIF(TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(a.authors,','),',',2),',',-1)),'')
  FROM unfortunately_designed_table a
 UNION
SELECT NULLIF(TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(a.authors,','),',',3),',',-1)),'')
  FROM unfortunately_designed_table a
 UNION
SELECT NULLIF(TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(a.authors,','),',',4),',',-1)),'')
  FROM unfortunately_designed_table a

this will return a resultset of unique author names (and undoubtedly a NULL). That's only getting the first four authors in the list, you'd need to extend that to get the fifth, sixth, etc.

You can get the maximum count of entries in that column by finding the maximum number of commas, and adding 1

SELECT MAX(LENGTH(a.authors)-LENGTH(REPLACE(a.authors,',','')))+1 AS max_count
  FROM unfortunately_designed_table a

That lets you know how far you need to extend the query above to get all of the author values (at the particular point in time you run the query... nothing prevents someone from adding another author to the list within a column at a later time.


After all the work to get distinct author values on separate rows, you'd probably want to leave them in a list like that. It's easier to work with.

But, of course, it's also possible to convert that resultset back into a comma delimited list, though the size of the string returned is limited by max_allowed_packet session variable (iirc).

To get it back as a single row, with a comma separated list, take that whole mess of a query from above, and wrap it in parens as an line view, give it an alias, and use the GROUP_CONCAT function.

SELECT GROUP_CONCAT(d.author ORDER BY d.author) AS distinct_authors
  FROM (
         ...
       ) d
 WHERE d.author IS NOT NULL

If you think all of these expressions are ugly, and there should be an easier way to do this, unfortunately (aside from writing procedural code), there really isn't. The relational database is designed to handle information in tuples (rows), with each row representing one entity. Stuffing multiple entities or values into a single column goes against relational design. As such, SQL does not provide a simple way to extract values from a string into separate tuples, which is why the code to do this is so messy.

spencer7593
  • 106,611
  • 15
  • 112
  • 140