Filter rows of a dataframe based on certain conditions in pandas

Question

While I was handling the dataframe in pandas, got some unexpected cells which consists values like-

E_no	E_name
6654-0984	Elvin-Johnson
430	Fred
663/547/900	Banty/Shon/Crio
87	Arif
546	Zerin
322,76	Chris,Deory

In some rows, more than one E_name and E_no has been assigned which is supposed to be a single employee in each and every cell My data consists of E_no and E_name both these column needs to be separated in different rows.

What I want is

E_no	E_name
6654	ELvin
0984	Johnson
430	Fred
663	Banty
547	Shon
900	Crio
87	Arif
546	Zerin
322	Chris
76	Deory

Seperate those values and put in different rows. Please help me in doing this so that I can proceed further, and it will be really helpful if someone can mention the logic , how to think for this prblm. Thanks in advance.

Let me know if ur facing any kind of difficulty in understanding the prblm

markdown tables should not be used for pandas data. It isn't useful to work with. Ideally it should be copyable data using the code blocks — noah, Jan 15 '21 at 19:53
Does this answer your question? [Split (explode) pandas dataframe string entry to separate rows](https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows) — noah, Jan 15 '21 at 19:54
@noah it helped but I didn't get my answer, anyway thnks for ur reply — Red Vibes, Jan 15 '21 at 20:14
@joe Ferndz thanks fr yr reply, I will do a study on split and explode for sure. — Red Vibes, Jan 15 '21 at 20:16

Joe Ferndz · Accepted Answer · 2021-01-15T21:26:49.357

Similar to Divyaansh's solution. Just use split, explode and merge.

import pandas as pd
df = pd.DataFrame({'E_no':['6654-0984','430','663/547/900','87','546', '322,76'],
                    'E_name':['Elvin-Johnson','Fred','Banty/Shon/Crio','Arif','Zerin','Chris,Deory']})

#explode each column
x = df['E_no'].str.split('[,-/]').explode().reset_index(drop=True)
y = df['E_name'].str.split('[,-/]').explode().reset_index(drop=True)

#Merge both the columns together
df2 = pd.merge(x,y,left_index=True,right_index=True)
   
#print the modified dataframe
print (df2)

Output of this will be:

Original Dataframe:

          E_no           E_name
0    6654-0984    Elvin-Johnson
1          430             Fred
2  663/547/900  Banty/Shon/Crio
3           87             Arif
4          546            Zerin
5       322,76      Chris,Deory

Modified Dataframe:

   E_no   E_name
0  6654    Elvin
1  0984  Johnson
2   430     Fred
3   663    Banty
4   547     Shon
5   900     Crio
6    87     Arif
7   546    Zerin
8   322    Chris
9    76    Deory

Alternate, you can also create a new dataframe with the values from x and y.

x = df['E_no'].str.split('[,-/]').explode().reset_index(drop=True)

y = df['E_name'].str.split('[,-/]').explode().reset_index(drop=True)

#Create a new dataframe with the new values from x and y
df3 = pd.DataFrame({'E_no':x,'E_name':y})

print (df3)

Same result as before.

Or this:

#explode each column
x = df['E_no'].str.split('[,-/]').explode().reset_index()
y = df['E_name'].str.split('[,-/]').explode().reset_index()

#Create a new dataframe with the new values from x and y
df3 = pd.DataFrame({'E_no':x['E_no'],'E_name':y['E_name']})

print (df3)

Or you can do:

#explode each column
x = df['E_no'].str.split('[,-/]').explode().reset_index(drop=True)

y = df['E_name'].str.split('[,-/]').explode().reset_index(drop=True)

df4 = pd.DataFrame([x,y]).T

print (df4)

noah · Answer 2 · 2021-01-15T21:43:17.837

Split, flatten, recombine, rename:

a = [item for sublist in df.E_no.str.split('\W').tolist() for item in sublist]
b = [item for sublist in df.E_name.str.split('\W').tolist() for item in sublist]
df2 = pd.DataFrame(list(zip(a, b)), columns=df.columns)

Output:

   E_no   E_name
0  6654    Elvin
1  0984  Johnson
2   430     Fred
3   663    Banty
4   547     Shon
5   900     Crio
6    87     Arif
7   546    Zerin
8   322    Chris
9    76    Deory

score 0 · Answer 3 · answered Jan 15 '21 at 19:51

0

You should provide multiple delimiter in the read_csv arguments.

pd.read_csv("filepath",sep='-|/|,| ')

This is the best I can help right now without the data tables.

answered Jan 15 '21 at 19:51

itsDV7

854
5
12

thanks for the reply bro, but would it bring those seperated values in different rows? What I think is it will only remove symbols like dashes, comma , slash etc – Red Vibes Jan 15 '21 at 19:55
This won't work. It will make more columns, not more rows. This is not what OP wants – noah Jan 15 '21 at 19:55

BallpointBen · Answer 4 · 2021-01-15T20:31:13.633

I think this is actually rather tricky. Here's a solution in which we use the E_no column to build a column of regexes that we will then use to split the two original columns into parts. Finally we construct a new DataFrame from those parts. This method ensures that the second column's format matches the first's.

df = pd.DataFrame.from_records(
    [
        {"E_no": "6654-0984", "E_name": "Elvin-Johnson"},
        {"E_no": "430", "E_name": "Fred"},
        {"E_no": "663/547/900", "E_name": "Banty/Shon/Crio"},
        {"E_no": "87", "E_name": "Arif"},
        {"E_no": "546", "E_name": "Zerin"},
        {"E_no": "322,76", "E_name": "Chris,Deory"},
        {"E_no": "888+88", "E_name": "FIRST+SEC|OND"},
        {"E_no": "999|99", "E_name": "TH,IRD|FOURTH"},
    ]
)


def get_pattern(e_no, delimiters=None):
    if delimiters is None:
        delimiters = "-/,|+"

    delimiters = "|".join(re.escape(d) for d in delimiters)
    non_match_delims = f"(?:(?!{delimiters}).)*"
    delim_parts = re.findall(f"{non_match_delims}({delimiters})", e_no)

    pattern_parts = []
    for delim_part in delim_parts:
        delim = re.escape(delim_part)
        pattern_parts.append(f"((?:(?!{delim}).)*)")
        pattern_parts.append(delim)
    pattern_parts.append("(.*)")

    return "".join(pattern_parts)


def extract_items(row, delimiters=None):
    pattern = get_pattern(row["E_no"], delimiters)
    nos = re.search(pattern, row["E_no"]).groups()
    names = re.search(pattern, row["E_name"]).groups()
    return (nos, names)


nos, names = map(
    lambda L: [e for tup in L for e in tup],
    zip(*df.apply(extract_items, axis=1))
)

print(pd.DataFrame({"E_no": nos, "E_names": names}))

    E_no  E_names
0   6654    Elvin
1   0984  Johnson
2    430     Fred
3    663    Banty
4    547     Shon
5    900     Crio
6     87     Arif
7    546    Zerin
8    322    Chris
9     76    Deory
10   888    FIRST
11    88  SEC|OND
12   999   TH,IRD
13    99   FOURTH

hey mate thanks fr yr reply, I really loved ur user name..lol :) well I'm not very expert in python so your answer really goes straight and jumped out of my head. I will give a run to this code in morning and will get back if I face any challenges. But I'm hoping it will run smoothly. Thanks again. — Red Vibes, Jan 15 '21 at 20:30

score 0 · Answer 5 · answered Jan 15 '21 at 20:43

Here is my approach to this:

1)Replace -&/ with comma(,) for both columns

2)Split each field on comma(,) then expand it and then stack it.

3)Resetting the index of each data frame which gives below DF

4) Finally merging two DF's into one finalDF

a={'E_no':['6654-0984','430','663/547/900','87','546','322,76'],
'E_name':['Elvin-Johnson','Fred','Banty/Shon/Crio','Arif','Zerin','Chris,Deory']}

df = pd.DataFrame(a)

df1=df['E_no'].str.replace('-|/',',').str.split(',',expand=True).stack().reset_index()

df2=df['E_name'].str.replace('-|/',',').str.split(',',expand=True).stack().reset_index()

df1.drop(['level_0','level_1'],axis=1,inplace=True)

df1.rename(columns={0:'E_no'},inplace=True)

df2.drop(['level_0','level_1'],axis=1,inplace=True)

df2.rename(columns={0:'E_name'},inplace=True)

finalDF=pd.merge(df1,df2,left_index=True,right_index=True)

Output:

Filter rows of a dataframe based on certain conditions in pandas

5 Answers5