0

While I was handling the dataframe in pandas, got some unexpected cells which consists values like-

E_no E_name
6654-0984 Elvin-Johnson
430 Fred
663/547/900 Banty/Shon/Crio
87 Arif
546 Zerin
322,76 Chris,Deory

In some rows, more than one E_name and E_no has been assigned which is supposed to be a single employee in each and every cell My data consists of E_no and E_name both these column needs to be separated in different rows.

What I want is

E_no E_name
6654 ELvin
0984 Johnson
430 Fred
663 Banty
547 Shon
900 Crio
87 Arif
546 Zerin
322 Chris
76 Deory

Seperate those values and put in different rows. Please help me in doing this so that I can proceed further, and it will be really helpful if someone can mention the logic , how to think for this prblm. Thanks in advance.

Let me know if ur facing any kind of difficulty in understanding the prblm

Red Vibes
  • 79
  • 8
  • markdown tables should not be used for pandas data. It isn't useful to work with. Ideally it should be copyable data using the code blocks – noah Jan 15 '21 at 19:53
  • 2
    Does this answer your question? [Split (explode) pandas dataframe string entry to separate rows](https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows) – noah Jan 15 '21 at 19:54
  • use split and explode to get them into multiple rows – Joe Ferndz Jan 15 '21 at 20:06
  • @noah it helped but I didn't get my answer, anyway thnks for ur reply – Red Vibes Jan 15 '21 at 20:14
  • @joe Ferndz thanks fr yr reply, I will do a study on split and explode for sure. – Red Vibes Jan 15 '21 at 20:16

5 Answers5

2

Similar to Divyaansh's solution. Just use split, explode and merge.

import pandas as pd
df = pd.DataFrame({'E_no':['6654-0984','430','663/547/900','87','546', '322,76'],
                    'E_name':['Elvin-Johnson','Fred','Banty/Shon/Crio','Arif','Zerin','Chris,Deory']})

#explode each column
x = df['E_no'].str.split('[,-/]').explode().reset_index(drop=True)
y = df['E_name'].str.split('[,-/]').explode().reset_index(drop=True)

#Merge both the columns together
df2 = pd.merge(x,y,left_index=True,right_index=True)
   
#print the modified dataframe
print (df2)

Output of this will be:

Original Dataframe:

          E_no           E_name
0    6654-0984    Elvin-Johnson
1          430             Fred
2  663/547/900  Banty/Shon/Crio
3           87             Arif
4          546            Zerin
5       322,76      Chris,Deory

Modified Dataframe:

   E_no   E_name
0  6654    Elvin
1  0984  Johnson
2   430     Fred
3   663    Banty
4   547     Shon
5   900     Crio
6    87     Arif
7   546    Zerin
8   322    Chris
9    76    Deory

Alternate, you can also create a new dataframe with the values from x and y.

x = df['E_no'].str.split('[,-/]').explode().reset_index(drop=True)

y = df['E_name'].str.split('[,-/]').explode().reset_index(drop=True)

#Create a new dataframe with the new values from x and y
df3 = pd.DataFrame({'E_no':x,'E_name':y})

print (df3)

Same result as before.

Or this:

#explode each column
x = df['E_no'].str.split('[,-/]').explode().reset_index()
y = df['E_name'].str.split('[,-/]').explode().reset_index()

#Create a new dataframe with the new values from x and y
df3 = pd.DataFrame({'E_no':x['E_no'],'E_name':y['E_name']})

print (df3)

Or you can do:

#explode each column
x = df['E_no'].str.split('[,-/]').explode().reset_index(drop=True)

y = df['E_name'].str.split('[,-/]').explode().reset_index(drop=True)

df4 = pd.DataFrame([x,y]).T

print (df4)
Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33
1

Split, flatten, recombine, rename:

a = [item for sublist in df.E_no.str.split('\W').tolist() for item in sublist]
b = [item for sublist in df.E_name.str.split('\W').tolist() for item in sublist]
df2 = pd.DataFrame(list(zip(a, b)), columns=df.columns) 

Output:

   E_no   E_name
0  6654    Elvin
1  0984  Johnson
2   430     Fred
3   663    Banty
4   547     Shon
5   900     Crio
6    87     Arif
7   546    Zerin
8   322    Chris
9    76    Deory
noah
  • 2,616
  • 13
  • 27
0

You should provide multiple delimiter in the read_csv arguments.

pd.read_csv("filepath",sep='-|/|,| ')

This is the best I can help right now without the data tables.

itsDV7
  • 854
  • 5
  • 12
  • thanks for the reply bro, but would it bring those seperated values in different rows? What I think is it will only remove symbols like dashes, comma , slash etc – Red Vibes Jan 15 '21 at 19:55
  • This won't work. It will make more columns, not more rows. This is not what OP wants – noah Jan 15 '21 at 19:55
0

I think this is actually rather tricky. Here's a solution in which we use the E_no column to build a column of regexes that we will then use to split the two original columns into parts. Finally we construct a new DataFrame from those parts. This method ensures that the second column's format matches the first's.

df = pd.DataFrame.from_records(
    [
        {"E_no": "6654-0984", "E_name": "Elvin-Johnson"},
        {"E_no": "430", "E_name": "Fred"},
        {"E_no": "663/547/900", "E_name": "Banty/Shon/Crio"},
        {"E_no": "87", "E_name": "Arif"},
        {"E_no": "546", "E_name": "Zerin"},
        {"E_no": "322,76", "E_name": "Chris,Deory"},
        {"E_no": "888+88", "E_name": "FIRST+SEC|OND"},
        {"E_no": "999|99", "E_name": "TH,IRD|FOURTH"},
    ]
)


def get_pattern(e_no, delimiters=None):
    if delimiters is None:
        delimiters = "-/,|+"

    delimiters = "|".join(re.escape(d) for d in delimiters)
    non_match_delims = f"(?:(?!{delimiters}).)*"
    delim_parts = re.findall(f"{non_match_delims}({delimiters})", e_no)

    pattern_parts = []
    for delim_part in delim_parts:
        delim = re.escape(delim_part)
        pattern_parts.append(f"((?:(?!{delim}).)*)")
        pattern_parts.append(delim)
    pattern_parts.append("(.*)")

    return "".join(pattern_parts)


def extract_items(row, delimiters=None):
    pattern = get_pattern(row["E_no"], delimiters)
    nos = re.search(pattern, row["E_no"]).groups()
    names = re.search(pattern, row["E_name"]).groups()
    return (nos, names)


nos, names = map(
    lambda L: [e for tup in L for e in tup],
    zip(*df.apply(extract_items, axis=1))
)

print(pd.DataFrame({"E_no": nos, "E_names": names}))

    E_no  E_names
0   6654    Elvin
1   0984  Johnson
2    430     Fred
3    663    Banty
4    547     Shon
5    900     Crio
6     87     Arif
7    546    Zerin
8    322    Chris
9     76    Deory
10   888    FIRST
11    88  SEC|OND
12   999   TH,IRD
13    99   FOURTH
BallpointBen
  • 9,406
  • 1
  • 32
  • 62
  • hey mate thanks fr yr reply, I really loved ur user name..lol :) well I'm not very expert in python so your answer really goes straight and jumped out of my head. I will give a run to this code in morning and will get back if I face any challenges. But I'm hoping it will run smoothly. Thanks again. – Red Vibes Jan 15 '21 at 20:30
0

Here is my approach to this:

1)Replace -&/ with comma(,) for both columns

2)Split each field on comma(,) then expand it and then stack it.

3)Resetting the index of each data frame which gives below DF

enter image description here 4) Finally merging two DF's into one finalDF

a={'E_no':['6654-0984','430','663/547/900','87','546','322,76'],
'E_name':['Elvin-Johnson','Fred','Banty/Shon/Crio','Arif','Zerin','Chris,Deory']}

df = pd.DataFrame(a)

df1=df['E_no'].str.replace('-|/',',').str.split(',',expand=True).stack().reset_index()

df2=df['E_name'].str.replace('-|/',',').str.split(',',expand=True).stack().reset_index()

df1.drop(['level_0','level_1'],axis=1,inplace=True)

df1.rename(columns={0:'E_no'},inplace=True)

df2.drop(['level_0','level_1'],axis=1,inplace=True)

df2.rename(columns={0:'E_name'},inplace=True)

finalDF=pd.merge(df1,df2,left_index=True,right_index=True)

Output:

enter image description here