1

I want to split equal length string without the splitter and expand the dataframe.

Here is the test dataframe I am using:

sample1 = pd.DataFrame({
        'TST': {1: 1535840000000, 2: 1535840000000}, 
        'RCV': {1: 1535840000000, 2: 1535850000000}, 
        'TCU': {1: 358272000000000, 2: 358272000000000}, 
        'SPD': {1: '0', 2: '00000000000000710000007D007C00E2'}
        })

As you can see, the SPD column contains various length string without any splitter.

I want to split the SPD column every 4 characters into new rows, then expand them to the dataframe.

             TST            RCV              TCU   SPD
0  1535840000000  1535840000000  358272000000000  0000
1  1535840000000  1535840000000  358272000000000  0000
2  1535840000000  1535840000000  358272000000000  0000
3  1535840000000  1535840000000  358272000000000  0071
4  1535840000000  1535840000000  358272000000000  0000
5  1535840000000  1535840000000  358272000000000  007D
6  1535840000000  1535840000000  358272000000000  007C
7  1535840000000  1535840000000  358272000000000  00E2

I tried to firstly generate a Series by using this:

pd.concat([pd.Series(re.findall('....', row['SPD'])) for _, row in sample1.iterrows()]).reset_index()

which gives

   index     0
0      0  0000
1      1  0000
2      2  0000
3      3  0071
4      4  0000
5      5  007D
6      6  007C
7      7  00E2

But I could not expand it back the sample1

cs95
  • 379,657
  • 97
  • 704
  • 746
J.D
  • 1,885
  • 4
  • 11
  • 19

3 Answers3

3

You can use str.findall, then repeat the rows based on the number of 4 character slices from SPD.

from itertools import chain

spd4 = df.pop('SPD').str.findall(r'.{4}') 

(pd.DataFrame(df.values.repeat(spd4.str.len(), axis=0), columns=df.columns)
   .assign(SPD=list(chain.from_iterable(spd4))))

             TST            RCV              TCU   SPD
0  1535840000000  1535850000000  358272000000000  0000
1  1535840000000  1535850000000  358272000000000  0000
2  1535840000000  1535850000000  358272000000000  0000
3  1535840000000  1535850000000  358272000000000  0071
4  1535840000000  1535850000000  358272000000000  0000
5  1535840000000  1535850000000  358272000000000  007D
6  1535840000000  1535850000000  358272000000000  007C
7  1535840000000  1535850000000  358272000000000  00E2
cs95
  • 379,657
  • 97
  • 704
  • 746
2

You could split the strings in SPD every 4 characters using str.findall, and then unnest the resulting dataframe with unnesting from the linked solution:

sample1['SPD'] = sample1.SPD.str.ljust(4, '0').str.findall(r'.{4}?')
unnesting(sample1, ['SPD'])

   SPD            TST            RCV              TCU
1  0000  1535840000000  1535840000000  358272000000000
2  0000  1535840000000  1535850000000  358272000000000
2  0000  1535840000000  1535850000000  358272000000000
2  0000  1535840000000  1535850000000  358272000000000
2  0071  1535840000000  1535850000000  358272000000000
2  0000  1535840000000  1535850000000  358272000000000
2  007D  1535840000000  1535850000000  358272000000000
2  007C  1535840000000  1535850000000  358272000000000
2  00E2  1535840000000  1535850000000  358272000000000
yatu
  • 86,083
  • 12
  • 84
  • 139
1

use Series.str.extractall and then join with the original df.

sample1.filter(regex='^(?!SPD)').join(
    sample1.SPD.str.extractall('(?P<SPD>.{4})').reset_index(level=1, drop=True)
) 

#             TST            RCV              TCU   SPD
#1  1535840000000  1535840000000  358272000000000   NaN
#2  1535840000000  1535850000000  358272000000000  0000
#2  1535840000000  1535850000000  358272000000000  0000
#2  1535840000000  1535850000000  358272000000000  0000
#2  1535840000000  1535850000000  358272000000000  0071
#2  1535840000000  1535850000000  358272000000000  0000
#2  1535840000000  1535850000000  358272000000000  007D
#2  1535840000000  1535850000000  358272000000000  007C
#2  1535840000000  1535850000000  358272000000000  00E2

use inner join(... how='inner') if you want to exclude rows having less that 4-char SPD.

jxc
  • 13,553
  • 4
  • 16
  • 34