1

I have a problem with exploding a DataFrame down into separate rows against a comma delimited list after splitting across to a set amount of cols. I'm trying to achieve this in Pandas but if this is possible using raw SQL (I tried and gave up) then that would be an ideal solution.

Sample Data

Reference   Surname   Forename   CurrentPostCode   PreviousPostCodes
1           Smith     John       WA1 2LA           WA2 HG5, LN4 6XS
2           Jones     Jack       NA1 2NE           None
3           Potter    Harry      LI8 0NX           None
4           Wane      Bruce      HE27 4PR          HE5 9PR
5           Finn      Grahame    B26 7UP           B15 6UR, B22 9JK, B13 3YT

The I want to split the PreviousPostCodes column across into two columns PPC1 and PPC2 and if the array/comma separated list has more than 2 items in it (in the case of Ref 5) it would need to split the first two and the add a row below and fill PPC1 with B13 3YT

Desired Output

Reference   Surname   Forename   CurrentPostCode   PPC1       PPC2
1           Smith     John       WA1 2LA           WA2 HG5    LN4 6XS
2           Jones     Jack       NA1 2NE           None       None
3           Potter    Harry      LI8 0NX           None       None
4           Wane      Bruce      HE27 4PR          HE5 9PR    None
5           Finn      Grahame    B26 7UP           B15 6UR    B22 9JK
5           Finn      Grahame    B26 7UP           B13 3YT    None

I hope this makes sense, I can split the list out but i get n cols and i want to limit that to a maximum size of 2, and overflow onto new rows if it exceeds 2. There isn't a limit to the amount of previous postcodes in the data so if there were 5 in the comma separated list, it would need to explode the row down into 3 new rows.

Thanks

CheyRav90
  • 65
  • 1
  • 6

2 Answers2

0
df[['PPC1','PPC2']] = df.pop('PreviousPostCodes').str.split(',\s*', n=1, expand=True)
df['PPC2'] = df['PPC2'].fillna('').str.split(',\s*', expand=False)

yields:

In [173]: df
Out[173]:
   Reference Surname Forename CurrentPostCode     PPC1                PPC2
0          1   Smith     John         WA1 2LA  WA2 HG5           [LN4 6XS]
1          2   Jones     Jack         NA1 2NE      NaN                  []
2          3  Potter    Harry         LI8 0NX      NaN                  []
3          4    Wane    Bruce        HE27 4PR  HE5 9PR                  []
4          5    Finn  Grahame         B26 7UP  B15 6UR  [B22 9JK, B13 3YT]

now we can use explode() function:

In [174]: explode(df, lst_cols='PPC2')
Out[174]:
   Reference Surname Forename CurrentPostCode     PPC1     PPC2
0          1   Smith     John         WA1 2LA  WA2 HG5  LN4 6XS
1          2   Jones     Jack         NA1 2NE      NaN
2          3  Potter    Harry         LI8 0NX      NaN
3          4    Wane    Bruce        HE27 4PR  HE5 9PR
4          5    Finn  Grahame         B26 7UP  B15 6UR  B22 9JK
5          5    Finn  Grahame         B26 7UP  B15 6UR  B13 3YT
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • That's almost it, PPC1 Is static in this case though, it's always Ordinal Position 0 of the list, the solution would be better if it was [1], [2] for the first row then [3], [4]. I think currently it's [1], [2], then [1], [3], if that makes sense? – CheyRav90 Jan 24 '18 at 11:30
  • Also thanks for the solution! the reason the above format is preferable is that this output data is being sent to be searched against and there is a max limit on the PPC per row of 2, having PPC1 statically being the first item in the list means we need to do more searches. – CheyRav90 Jan 24 '18 at 11:51
0

Try this Sql script it may hepls you .Below is sample data

IF OBJECT_ID('tempdb..#temp') IS NOT NULL
DROP TABLE #temp
;With cte(Reference ,  Surname,   Forename ,  CurrentPostCode,   PreviousPostCodes)
AS
(
SELECT 1,'Smith' ,'John'   , 'WA1 2LA'  ,'WA2 HG5, LN4 6XS,B13 3YT,AA18 3YT,YT783 3YT'              UNION ALL
SELECT 2,'Jones' ,'Jack'   , 'NA1 2NE'  ,'None'                         UNION ALL
SELECT 3,'Potter','Harry'  , 'LI8 0NX'  ,'None'                         UNION ALL
SELECT 4,'Wane'  ,'Bruce'  , 'HE27 4PR' ,'HE5 9PR,B13 3YT,RT4 YT5'                      UNION ALL
SELECT 5,'Finn'  ,'Grahame', 'B26 7UP'  ,'B15 6UR, B22 9JK, B13 3YT'
)
SELECT * INTO #temp  FROM cte
SELECT * FROM #temp 

By using dynamic sql we get n number of columns depending up on the prviousPostCode column where data is separated by comma ,n column will be created as old postcodes depend on n number of comma

  --To get the number of columns to be divided dynamically
    DECLARE @ColumnsDivideCnt INT
        ,@Dyncol nvarchar(max)
        ,@Sql nvarchar(max)
;WITH cte
AS
(
SELECT 0 As Rn, CHARINDEX(',',PreviousPostCodes+',') AS Pos ,PreviousPostCodes FROM #temp
UNION ALL
SELECT Pos+1,CHARINDEX(',',PreviousPostCodes+',',Pos+1) ,PreviousPostCodes
FROM cte

WHERE Pos >0
)
SELECT @ColumnsDivideCnt=MAX(ColumnToGet) FROm
(
SELECT PreviousPostCodes, Pos,ROW_NUMBER()OVER(Partition by PreviousPostCodes Order by PreviousPostCodes) AS ColumnToGet FROM cte
WHERE Pos >0
GROUP BY PreviousPostCodes,Pos
)dt

--Get the column names dynamically
;WIth cte2
AS
(
SELECT 1 AS Rn 
UNION ALL
SELECT Rn+1
From cte2
WHERE Rn<@ColumnsDivideCnt
)
SELECT @Dyncol=STUFF((SELECT ', ' + ReqCol FROM
(
SELECT 'ISNULL(Split.a.value('+'''/S['+CAST(Rn AS VARCHAR(2))+']'+''''+','+'''NVARCHAR(1000)'''+'),''None'') As [PPC'+CAST(Rn AS VARCHAR(2))+']'  AS ReqCol FROM cte2
)Dt
FOR XML PATH ('')),1,1,'')


SET @Sql='SELECT DISTINCT
                Reference
                ,Surname
                ,Forename
                ,CurrentPostCode
                ,'+@Dyncol+'
        FROM (
            SELECT Reference,Surname,Forename,CurrentPostCode,
                CAST(''<S>''+REPLACE(PreviousPostCodes,'','',''</S><S>'')+''</S>'' AS XML)AS  PreviousPostCodes
            FROM #temp
) AS A
CROSS APPLY PreviousPostCodes.nodes(''S'') AS Split(a)
'
PRINT @Sql
EXEC (@Sql)

Result before Dynamic sql script run

Reference   Surname Forename    CurrentPostCode     PreviousPostCodes
-----------------------------------------------------------------------------------------------
1           Smith   John        WA1 2LA             WA2 HG5, LN4 6XS,B13 3YT,AA18 3YT,YT783 3YT
2           Jones   Jack        NA1 2NE             None
3           Potter  Harry       LI8 0NX             None
4           Wane    Bruce       HE27 4PR            HE5 9PR,B13 3YT,RT4 YT5
5           Finn    Grahame     B26 7UP             B15 6UR, B22 9JK, B13 3YT

Result AfterDynamic sql script Run

Reference   Surname Forename    CurrentPostCode  PPC1        PPC2       PPC3        PPC4
--------------------------------------------------------------------------------------------------
1           Smith   John         WA1 2LA        WA2 HG5     LN4 6XS     B13 3YT     AA18 3YT
2           Jones   Jack         NA1 2NE        None        None        None        None
3           Potter  Harry        LI8 0NX        None        None        None        None
4           Wane    Bruce        HE27 4PR       HE5 9PR     B13 3YT     RT4 YT5     None
5           Finn    Grahame      B26 7UP        B15 6UR     B22 9JK     B13 3YT     None
Sreenu131
  • 2,476
  • 1
  • 7
  • 18
  • Thanks for the answer, this splits the PreviousPostCodes as required but i need a new row once the count of PreviousPostCodes exceeds 2,4,6 etc. and to duplicate the rest of the row information. – CheyRav90 Jan 24 '18 at 12:35