1

i have a pandas dataframe contains many columns like Name, Email, Mobile Number etc. . which looks like this :

Sr No.    Name    Email              Mobile Number
1.        John    joh***@gmail.com   1234567890,2345678901
2.        kylie   k.ki**@yahoo.com   6789012345
3.        jon     null               1234567890
4.        kia     kia***@gmail.com   6789012345
5.        sam     b.sam**@gmail.com  4567890123

I want to remove the rows which contains same Mobile Number. One person can have more than one number. I done this through drop_duplicates function. I tried this:

newdf = df.drop_duplicates(subset = ['Mobile Number'],keep=False)

Here is output :

Sr No.    Name    Email              Mobile Number
1.        John    joh***@gmail.com   1234567890,2345678901
3.        jon     null               1234567890
5.        sam     b.sam**@gmail.com  4567890123

But the problem is it only removes the rows which are exactly same. but i want to remove the row which contains at least one same number i.e Sr. No. 1 and 3 they have one same number. How can i remove them so the final output looks like this :

 final output:
 Sr No.    Name    Email              Mobile Number
 5.        sam     b.sam**@gmail.com  4567890123
rahul
  • 53
  • 4
  • Do you want to capture the second mobile number as a separate row? or delete the entire row for John? – Joe Ferndz Mar 30 '21 at 08:43
  • i want to delete entire row if any single number matched. – rahul Mar 30 '21 at 08:47
  • I was able to solve this. See my response. Let me know if any scenario fails. Also would like to know the performance on this if you are running on a large dataset. – Joe Ferndz Mar 30 '21 at 10:16

2 Answers2

1

Alright. It is a complicated solution but I was able to solve for it.

Here's how I am doing it.

First, I take all the mobile numbers and split them by ,. Then I explode them (it will retain same index).

Then find all the index of rows with duplicates.

Then exclude the rows from the dataframe if the index was part of the duplicate

This will give you the unique rows that do not have any duplicates.

I modified your dataframe to have a few options.

c = ['Name','Email','Mobile Number']
d = [['John','joh***@gmail.com','1234567890,2345678901,6789012345'],
['kylie','k.ki**@yahoo.com','6789012345'],
['jon','null','1234567890'],
['kia','kia***@gmail.com','6789012345'],
['mac','mac***@gmail.com','2345678901,1098765432'],
['kfc','kfc***@gmail.com','6237778901,1098765432,3034045050'],
['pig','pig***@gmail.com','8007778001,8018765454,5054043030'],
['bil','bil***@gmail.com','1098765432'],
['jun','jun***@gmail.com','9098785434'],
['sam','b.sam**@gmail.com','4567890123']]
import pandas as pd
df = pd.DataFrame(d,columns=c)
print (df)
temp = df.copy()
temp['Mobile Number'] = temp['Mobile Number'].apply(lambda x: x.split(','))
temp = temp.explode('Mobile Number')
#print (temp)
df2 = df[~df.index.isin(temp[temp['Mobile Number'].duplicated(keep=False)].index)]
print (df2)

The output of this is:

Original DataFrame:

    Name              Email                     Mobile Number
0   John   joh***@gmail.com  1234567890,2345678901,6789012345  # duplicated index: 1, 2,3, 4
1  kylie   k.ki**@yahoo.com                        6789012345  # duplicated index: 0, 3
2    jon               null                        1234567890  # duplicated index: 0
3    kia   kia***@gmail.com                        6789012345  # duplicated index: 0
4    mac   mac***@gmail.com             2345678901,1098765432  # duplicated index: 0
5    kfc   kfc***@gmail.com  6237778901,1098765432,3034045050  # duplicated index: 7
6    pig   pig***@gmail.com  8007778001,8018765454,5054043030  # no duplicate; should output
7    bil   bil***@gmail.com                        1098765432  # duplicated index: 5
8    jun   jun***@gmail.com                        9098785434  # no duplicate; should output
9    sam  b.sam**@gmail.com                        4567890123  # no duplicate; should output

The output of this will be the 3 rows (index: 6, 8, and 9):

  Name              Email                     Mobile Number
6  pig   pig***@gmail.com  8007778001,8018765454,5054043030
8  jun   jun***@gmail.com                        9098785434
9  sam  b.sam**@gmail.com                        4567890123

Since temp is not needed anymore, you can just delete it using del temp.

Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33
  • Thanks...It is working. Can you please explain to me df2 syntax a little and why you use explode and what ~ symbol do? – rahul Mar 30 '21 at 11:34
  • with split, we are creating a list of values. With the use of explode, we are creating separate rows for each value. Ex: for `mac`, we have `2345678901,1098765432`. With explode, we create two rows for `mac` with same index number. – Joe Ferndz Mar 30 '21 at 12:14
  • The symbol `~` is to negate the condition. So we want to take all the indexes that are NOT found in the duplicated rows. We are looking for all indexes in `temp` and excluding them from `df` to get the rows that are unique. Hope my explanation was good enough. I will post a few links to help you understand these better – Joe Ferndz Mar 30 '21 at 12:16
  • i get it but why you use ~. as far as i know it is used to invert the array. why we need to invert the dataframe? – rahul Mar 30 '21 at 12:16
  • See this link to get more information about `~` [use of in and not in](https://stackoverflow.com/questions/19960077/how-to-filter-pandas-dataframe-using-in-and-not-in-like-in-sql) – Joe Ferndz Mar 30 '21 at 12:19
0

One possible solution is to do the following. Say your df is given by

Sr No.   Name              Email            Mobile Number
0     1.0   John   joh***@gmail.com  1234567890 , 2345678901
1     2.0  kylie   k.ki**@yahoo.com               6789012345
2     3.0    jon                NaN               1234567890
3     4.0    kia   kia***@gmail.com               6789012345
4     5.0    sam  b.sam**@gmail.com               4567890123

You can split your Mobile Number column into two (or more) columns mob1, mob2,.... and then drop duplicates

df[['mob1', 'mob2']]= df["Mobile Number"].str.split(" , ", n = 1, expand = True) 
newdf = df.drop_duplicates(subset = ['mob1'],keep=False)

which returns

   Sr No. Name              Email Mobile Number        mob1  mob2
4     5.0  sam  b.sam**@gmail.com    4567890123  4567890123  None

EDIT

To handle the possible swapped order of numbers, one can extend the method by dropping duplicates from all created columns:

df[['mob1', 'mob2']]= df["Mobile Number"].str.split(" , ", n = 1, expand = True) 
newdf = df.drop_duplicates(subset = ['mob1'],keep=False)
newdf = df.drop_duplicates(subset = ['mob2'],keep=False)

which returns:

Sr No.  Name             Email            Mobile Number        mob1  \
0     1.0  John  joh***@gmail.com  2345678901 , 1234567890  2345678901   

         mob2  
0  1234567890  

If there are individuals with more than two number then as many columns as the maximum number of phone numbers need to be created.

  • what if the values were swapped for first row as `2345678901, 1234567890`, would this work? – Joe Ferndz Mar 30 '21 at 09:00
  • Also, what if there are 3 mobile phone numbers in the list `7774567890 , 2345678901, 6789012345` – Joe Ferndz Mar 30 '21 at 09:02
  • Do you know the maximum number of phone by individuals. I have encountered the same issue in a different context but haven't been able to find a better way to do this. – Serge de Gosson de Varennes Mar 30 '21 at 09:09
  • 1
    There can be 4 maximum numbers by individuals. Also in some case one can have 4 numbers and other can have 2 in which one number is matching with first 4 and another may different. basically if any single number matched with another i want to remove both of them. – rahul Mar 30 '21 at 09:18
  • 1
    @SergedeGossondeVarennes, I posted my answer. You should check it out. Maybe this will help solve your question as well. – Joe Ferndz Mar 30 '21 at 10:16