3

I have a legacy product that I have to maintain. One of the table is somewhat similar to the following example:

DECLARE @t TABLE
(
 id INT,
 DATA NVARCHAR(30)
);

INSERT  INTO @t
        SELECT  1,
                'name: Jim Ey'
        UNION ALL
        SELECT  2,
                'age: 43'
        UNION ALL
        SELECT  3,
                '----------------'
        UNION ALL
        SELECT  4,
                'name: Johnson Dom'
        UNION ALL
        SELECT  5,
                'age: 34'
        UNION ALL
        SELECT  6,
                '----------------'
        UNION ALL
        SELECT  7,
                'name: Jason Thwe'
        UNION ALL
        SELECT  8,
                'age: 22'

SELECT  *
FROM    @t;
/*
You will get the following result
id          DATA
----------- ------------------------------
1           name: Jim Ey
2           age: 43
3           ----------------
4           name: Johnson Dom
5           age: 34
6           ----------------
7           name: Jason Thwe
8           age: 22
*/

Now I want to get the information in the following form:

name           age
-------------- --------
Jim Ey         43
Johnson Dom    34
Jason Thwe     22

What's the easiest way to do this? Thanks.

Just a learner
  • 26,690
  • 50
  • 155
  • 234
  • This appears to be a case of [dynamic database](http://stackoverflow.com/a/68138/458741). I understand you didn't design it but can you change it? Normalization would make you life a lot easier. – Ben Apr 22 '12 at 08:58

3 Answers3

4

Out of (slightly morbid) curiosity I tried to come up with a means of transforming the exact input data you have provided.

Far better, of course, would be to properly structure the original data. With a legacy system, this may not be possible, but an ETL process could be created to bring this information into an intermediate location so that an ugly query like this would not need to be run in real time.

Example #1

This example assumes that all IDs are consistent and sequential (otherwise, an additional ROW_NUMBER() column or a new identity column would need to be used to guarantee correct remainder operations on ID).

SELECT
    Name = REPLACE( Name, 'name: ', '' ),
    Age = REPLACE( Age, 'age: ', '' )
FROM
(
    SELECT
        Name = T2.Data,
        Age = T1.Data,
        RowNumber = ROW_NUMBER() OVER( ORDER BY T1.Id ASC )

    FROM @t T1 
        INNER JOIN @t T2 ON T1.id = T2.id +1 -- offset by one to combine two rows
    WHERE T1.id % 3 != 0 -- skip delimiter records
) Q1
 -- skip every other record (minus delimiters, which have already been stripped)
WHERE RowNumber % 2 != 0

Example #2: No Dependency on Sequential IDs

This is a more practical example because the actual ID values do not matter, only the row sequence.

DECLARE @NumberedData TABLE( RowNumber INT, Data VARCHAR( 100 ) );

INSERT @NumberedData( RowNumber, Data )
    SELECT 
        RowNumber = ROW_NUMBER() OVER( ORDER BY id ASC ),
        Data
    FROM @t;

SELECT 
    Name = REPLACE( N2.Data, 'name: ', '' ),
    Age = REPLACE( N1.Data, 'age: ', '' ) 
FROM @NumberedData N1 
    INNER JOIN @NumberedData N2 ON N1.RowNumber = N2.RowNumber + 1
WHERE ( N1.RowNumber % 3 ) = 2;

DELETE @NumberedData;

Example #3: Cursor

Again, it would be best to avoid running a query like this in real time and use a scheduled, transactional ETL process. In my experience, semi-structured data like this is prone to anomalies.

While examples #1 and #2 (and the solutions provided by others) demonstrate clever ways of working with the data, a more practical way to transform this data would be a cursor. Why? it may actually perform better (no nested queries, recursion, pivoting, or row numbering) and even if it is slower it provides much better opportunities for error handling.

-- this could be a table variable, temp table, or staging table
DECLARE @Results TABLE ( Name VARCHAR( 100 ), Age INT );

DECLARE @Index INT = 0, @Data VARCHAR( 100 ), @Name VARCHAR( 100 ), @Age INT;

DECLARE Person_Cursor CURSOR FOR SELECT Data FROM @t;
OPEN Person_Cursor;
FETCH NEXT FROM Person_Cursor INTO @Data;

WHILE( 1 = 1 )BEGIN -- busy loop so we can handle the iteration following completion
    IF( @Index = 2 ) BEGIN
        INSERT @Results( Name, Age ) VALUES( @Name, @Age );
        SET @Index = 0;
    END
    ELSE BEGIN
            -- optional: examine @Data for integrity

        IF( @Index = 0 ) SET @Name = REPLACE( @Data, 'name: ', '' );
        IF( @Index = 1 ) SET @Age = CAST( REPLACE( @Data, 'age: ', '' ) AS INT );
        SET @Index = @Index + 1;
    END

    -- optional: examine @Index to see that there are no superfluous trailing 
    -- rows or rows omitted at the end.

    IF( @@FETCH_STATUS != 0 ) BREAK;
    FETCH NEXT FROM Person_Cursor INTO @Data;
END

CLOSE Person_Cursor;
DEALLOCATE Person_Cursor;

Performance

I created sample source data of 100K rows and the three aforementioned examples seem roughly equivalent for transforming data.

I created a million rows of source data and a query similar to the following gives excellent performance for selecting a subset of rows (such as would be used in a grid on a web page or a report).

-- INT IDENTITY( 1, 1 ) numbers the rows for us
DECLARE @NumberedData TABLE( RowNumber INT IDENTITY( 1, 1 ), Data VARCHAR( 100 ) );

-- subset selection; ordering/filtering can be done here but it will need to preserve
-- the original 3 rows-per-result structure and it will impact performance
INSERT @NumberedData( Data )
    SELECT TOP 1000 Data FROM @t;

SELECT
    N1.RowNumber,
    Name = REPLACE( N2.Data, 'name: ', '' ),
    Age = REPLACE( N1.Data, 'age: ', '' ) 
FROM @NumberedData N1 
    INNER JOIN @NumberedData N2 ON N1.RowNumber = N2.RowNumber + 1
WHERE ( N1.RowNumber % 3 ) = 2;

DELETE @NumberedData;

I'm seeing execution times of 4-10ms (i7-3960x) against a set of a million records.

Tim M.
  • 53,671
  • 14
  • 120
  • 163
1

Given that table you can do this:

;WITH DATA
AS
(
    SELECT
        SUBSTRING(t.DATA,CHARINDEX(':',t.DATA)+2,LEN(t.DATA)) AS value,
        SUBSTRING(t.DATA,0,CHARINDEX(':',t.DATA)) AS ValueType,
        ID,
        ROW_NUMBER() OVER(ORDER BY ID) AS RowNbr
    FROM
        @t AS t
    WHERE
        NOT t.DATA='----------------'
)
, RecursiveCTE
AS
(
    SELECT
        Data.RowNbr,
        Data.value,
        Data.ValueType,
        NEWID() AS ID
    FROM
        Data
    WHERE
        Data.RowNbr=1
    UNION ALL
    SELECT
        Data.RowNbr,
        Data.value,
        Data.ValueType,
        CASE 
            WHEN Data.ValueType='age'
            THEN RecursiveCTE.ID
            ELSE NEWID()
        END AS ID
    FROM
        Data
        JOIN RecursiveCTE
            ON RecursiveCTE.RowNbr+1=Data.RowNbr
)
SELECT
    pvt.name,
    pvt.age
FROM
    (
        SELECT
            ID,
            value,
            ValueType
        FROM
            RecursiveCTE
    ) AS SourceTable
    PIVOT
    (
        MAX(Value)
        FOR ValueType IN ([name],[age])
    ) AS pvt

Output

Name          Age
------------------
Jim Ey        43
Jason Thwe    22
Johnson Dom   34
Arion
  • 31,011
  • 10
  • 70
  • 88
0

Here's another option if you upgrade to SQL Server 2012, which implements the OVER clause for aggregate functions. This approach will allow you to choose only those tags that you know you want and find them regardless of how many rows there are between names.

This will also work if the names and ages are not always in the same order within a group of rows representing a single person.

with Ready2Pivot(tag,val,part) as (
  select
    CASE WHEN DATA like '_%:%' THEN SUBSTRING(DATA,1,CHARINDEX(':',DATA)-1) END as tag,
    CASE WHEN DATA like '_%:%' THEN SUBSTRING(DATA,CHARINDEX(':',DATA)+1,8000) END as val,
    max(id * CASE WHEN DATA LIKE 'name:%' THEN 1 ELSE 0 END)
    over (
      order by id
    )
  from @t
  where DATA like '_%:%'
)
  select [name], [age]
  from Ready2Pivot
  pivot (
    max(val)
    for tag in ([name], [age])
  ) as p

If your legacy data has an entry with extra items (say "altName: Jimmy"), this query will ignore it. If your legacy data has no row (and no id number) for someone's age, it will give you NULL in that spot. It will associate all information with the closest preceding row with "name: ..." as the DATA, so it is important that every group of rows has a "name: ..." row.

Steve Kass
  • 7,144
  • 20
  • 26