0

I need to split a dataframe column into multiple columns to make sure only two value is contained within each cell. The current dataframe looks like:

          Name     |  Number |  Code |
         ..............................
         Tom      | 78797071|       0
         Nick     |         | 89797071
         Juli     |         | 57797074
         June     | 39797571|       0
         Junw     |         | 23000000|

if code contain 8 digit number then split every two digit number in each column and if 00 comes in any of the DIV it should be marked as 'incomplete'

The new dataframe should look like:

     Name     |  Number |  Code |  DIV|DIV2|DIV3|DIV4|Incomplete  |
     ........................................................................
     Tom      | 78797071|       0 | 0 |   0|  0 |   0 |incomplete |
     Nick     |         | 89797071| 89| 79 | 70 | 71  |complete   |
     Juli     |         | 57797074| 57| 79 | 70 | 74  |complete   |
     June     | 39797571|       0 |  0|   0|  0 |   0 |complete   |
     Junw     |         | 23000000| 23|  00| 00 | 00  |incomplete |

3 Answers3

0

Try this quick fix.

import pandas as pd
import re

#data-preprocessing
data = {'Name': ['Tom','Nick','Juli','June','Junw'],'Code': ['0', '89797071', '57797074', '0', '23000000']}

#I omitted Number key in data

df = pd.DataFrame(data)

print(df)

#find patterns

pattern = r'(\d{2})(\d{2})(\d{2})(\d{2})'
zero_pattern = r'0{1,}'

split_data = []

for _ in df['Code'].items():

  to_find = _[1]

  splitted = re.findall(pattern, to_find)
  if splitted:
    temp = list(splitted[0])
    if '00' in temp:
      temp.append('incomplete')
    else:
      temp.append('complete')
    split_data.append(temp)

  zeromatch = re.match(zero_pattern, to_find)
  if zeromatch:
    split_data.append(['0','0','0','0','incomplete'])

#make right dataframe

col_name = ['DIV1','DIV2','DIV3','DIV4','Incomplete']

df2 = pd.DataFrame(split_data, columns=col_name)  

df[col_name]= df2

print(df)

Output

   Name      Code
0   Tom         0
1  Nick  89797071
2  Juli  57797074
3  June         0
4  Junw  23000000
   Name      Code DIV1 DIV2 DIV3 DIV4  Incomplete
0   Tom         0    0    0    0    0  incomplete
1  Nick  89797071   89   79   70   71    complete
2  Juli  57797074   57   79   70   74    complete
3  June         0    0    0    0    0  incomplete
4  Junw  23000000   23   00   00   00  incomplete
QuantStats
  • 1,448
  • 1
  • 6
  • 14
0

you can do it using string functions zfill and findall like below


df.Code = df.Code.astype(np.str)

## zfill will pad string with 0 to make its lenght 8, findall will find each pair of digit
## explode will split list into rows (explode works with pandas 0.25 and above)
## reshape to make it 4 columns
arr = df.Code.str.zfill(8).str.findall(r"(\d\d)").explode().values.reshape(-1, 4)

## create new dataframe from arr with given column names
df2 = pd.DataFrame(arr, columns=[f"Div{i+1}" for i in range(arr.shape[1])])

## set "Incomplete" colum to incomplete if any column of row contains "00"
df2["Incomplete"] = np.where(np.any(arr == "00", axis=1), "incomplete", "complete")

pd.concat([df,df2], axis=1)


Result

        Name    Number  Code    Div1    Div2    Div3    Div4    Incomplete
0   Tom 78797071    0   00  00  00  00  incomplete
1   Nick        89797071    89  79  70  71  complete
2   Juli        57797074    57  79  70  74  complete
3   June    39797571    0   00  00  00  00  incomplete
4   Junw        23000000    23  00  00  00  incomplete
Dev Khadka
  • 5,142
  • 4
  • 19
  • 33
0

You can use str.findall("..") to split the values, then join the list on the original df. Use apply to get the complete/incomplete status.

import pandas as pd

df = pd.DataFrame({"Name":["Tom","Nick","Juli","June","Junw"],
                   "Number":[78797071, 0, 0, 39797571, 0],
                   "Code":[0, 89797071, 57797074, 0, 23000000]})

df = df.join(pd.DataFrame(df["Code"].astype(str).str.findall("..").values.tolist()).add_prefix('DIV')).fillna("00")
df["Incomplete"] = df.iloc[:,3:7].apply(lambda row: "incomplete" if row.str.contains('00').any() else "complete", axis=1)

print (df)

#
   Name    Number      Code DIV0 DIV1 DIV2 DIV3  Incomplete
0   Tom  78797071         0   00   00   00   00  incomplete
1  Nick         0  89797071   89   79   70   71    complete
2  Juli         0  57797074   57   79   70   74    complete
3  June  39797571         0   00   00   00   00  incomplete
4  Junw         0  23000000   23   00   00   00  incomplete
Henry Yik
  • 22,275
  • 4
  • 18
  • 40
  • thanks @Henry Yik it working what if i want to replace 0 value present in CODE with the value present in NUMBER i have use this df = df[df.Code == '0']['Number'] this replaceswith only two values –  Oct 05 '19 at 07:38
  • Do you mean you want to fill the 0 in column Code with value from adjacent number value? – Henry Yik Oct 05 '19 at 07:40
  • Use `np.where`. `df['Code'] = np.where(df['Code'] == 0, df['Number'], df['Code'])`. – Henry Yik Oct 05 '19 at 07:45
  • why we are using .values.tolist() to create list of list –  Oct 05 '19 at 08:15
  • `pd.DataFrame(df["Code"].astype(str).str.findall("..").values.tolist())` creates a new `DataFrame` object using the list of lists from your original `df`. – Henry Yik Oct 05 '19 at 08:17
  • i had one more question what if column are not fixed then df.iloc[:,3:7] condition will failed –  Oct 05 '19 at 08:23
  • Read [Selecting multiple columns in a pandas dataframe](https://stackoverflow.com/questions/11285613/selecting-multiple-columns-in-a-pandas-dataframe). – Henry Yik Oct 05 '19 at 08:26