Using python pandas how to do some analysis for identify valid mobile numbers

Question

I have got daily MIS Fields are " Name,Number and Location ". Now, Total I have 100 Rows data daily basis and I have to first check that the numbers are in 10 digit or not, if Number fields are 1 to 9 digit , i have to remove that entry in my MIS,

only valid number like 10 digit and additional +91 before 10 digit number is valid. so, in excel i have to daily remove that invalid numbers and all its manually.

next i have to send it to valid number in 2 branches.50% valid number in 1st branch and 50% valid number in 2nd branch,

In 1st branch there are two persons, so again I have to send to both person equally valid number data entry. So, For example : if out of 100 data rows, total valid number is 60 , Then in 1st branch total 30 valid numbers occurs, and each two person get 15-15 numbers.

In 2nd branch there are three persons, valid 30 numbers occurs and each three get 10-10-10 numbers.

Any help it will grateful.

Here is my code.

import pandas as pd
import numpy as np
df = pd.read_csv('/home/desktop/Desktop/MIS.csv')
df
      Name        Number Location
0   Jayesh        980000     Pune
1     Ajay    9890989090   Mumbai
2   Manish    9999999999     Pune
3   Vikram  919000000000     Pune
4  Prakash  919999999999   Mumbai
5   Rakesh  919999999998   Mumbai
6   Naresh          9000     Pune


df['Number']=df['Number'].astype(str).apply(lambda x: np.where((len(x)<=10)))

Vivek Kalyanarangan · Answer 1 · 2018-12-29T08:05:58.330

3

Use -

df['Number'].astype(str).str.match(r'(\+)*(91)*(\d{10})')

Output

0    False
1     True
2     True
3     True
4     True
5     True
6    False
Name: Number, dtype: bool

Update

Use this bool series to filter -

df_filtered = df[df['Number'].astype(str).str.match(r'(\+)*(91)*(\d{10})', as_indexer=True)]


Name    Number  Location
1   Ajay    9890989090  Mumbai
2   Manish  9999999999  Pune
3   Vikram  919000000000    Pune
4   Prakash 919999999999    Mumbai
5   Rakesh  919999999998    Mumbai

edited Dec 29 '18 at 08:05

answered Dec 29 '18 at 07:48

Vivek Kalyanarangan

8,951
1
23
42

df['Number'].astype(str).str.match(r'(\+)*(91)*(\d{10})') 0 [] 1 (None, None, 9890989090) 2 (None, None, 9999999999) 3 (None, 91, 9000000000) 4 (None, 91, 9999999999) 5 (None, 91, 9999999998) 6 [] Name: Number, dtype: object – Kiran Dec 29 '18 at 07:57
@Kiran what is your pandas version? – Vivek Kalyanarangan Dec 29 '18 at 07:58
Pandas version u'0.18.1' – Kiran Dec 29 '18 at 07:59
1

@Kiran Try `df['Number'].astype(str).str.match(r'(\+)*(91)*(\d{10})', as_indexer=True)` if it doesn't consider upgrading your pandas version :-) – Vivek Kalyanarangan Dec 29 '18 at 08:01
Its working, now how to use True data rows for next analysis – Kiran Dec 29 '18 at 08:04
This is a better solution and faster computation time. +1 from me. – anky Dec 29 '18 at 08:06
This does not take into account typo's where there would be 11 digits, or a foreign number such as +331234567890. – gosuto Dec 29 '18 at 14:55
@jorijnsmit OP does not state those validations as a question. OP explicitly wants 10 digit numbers and Indian ISD code +91 – Vivek Kalyanarangan Dec 29 '18 at 17:37
Yes he does: "only valid number like 10 digit and additional +91 before 10 digit number is valid". I would argue that an 11 digit number (e.g. caused by a fat finger typo) is then invalid---not the first ten digits of it valid and the last digit not. – gosuto Dec 29 '18 at 17:46

score 3 · Answer 2 · answered Dec 29 '18 at 12:13

3

It's tempting to convert your numbers to strings and then perform your comparisons. However, this isn't necessary and will typically be inefficient. You can use regular Boolean comparisons with a direct algorithm:

m1 = (np.log10(df['Number']).astype(int) + 1) == 12
m2 = (df['Number'] // 10**10) == 91

df_filtered = df[m1 & m2]

print(df_filtered)

      Name        Number Location
3   Vikram  919000000000     Pune
4  Prakash  919999999999   Mumbai
5   Rakesh  919999999998   Mumbai

answered Dec 29 '18 at 12:13

jpp

159,742
34
281
339

I like your style of thinking but I am still leaning towards the regex solution. Why are you converting to int? Obviously we don't have the full dataset, but if there is both strings and ints (and maybe other types) in this column, does it really matter whether we convert all to string or the other way around? – gosuto Dec 29 '18 at 14:29
@jorijnsmit, Yes, it matters, string manipulations are less efficient generally. On top, regex is less efficient than Python `str` methods. So you're taking the most expensive of an already expensive option. For `int` conversion, see answers at [Length of an integer in Python](https://stackoverflow.com/questions/2189800/length-of-an-integer-in-python) for why it's necessary. – jpp Dec 29 '18 at 15:09

anky · Answer 3 · 2018-12-29T07:49:35.763

1

for assigning nan to str which doesnot start with 91 and is less than 10 digits:

df.Number[(~df.Number.str.startswith('91',na=False))&[len(df.Number[i])!= 10 for i in df.index]] = np.nan

edited Dec 29 '18 at 07:49

answered Dec 29 '18 at 07:26

anky

74,114
11
41
70

Traceback (most recent call last): File "", line 1, in df.Number[[len(df.Number[i]) == 10 for i in df.index]] TypeError: object of type 'numpy.int64' has no len() – Kiran Dec 29 '18 at 07:29
check edit, you have to assign the type as string to the column. – anky Dec 29 '18 at 07:30
code is work, but before 91 assign number give to NAN output, but this are valid numbers – Kiran Dec 29 '18 at 07:45
thanks it helps, now how to remove that NAN and etc. – Kiran Dec 29 '18 at 07:56
1

`df.dropna(subset=['Number']) ` – anky Dec 29 '18 at 07:58

Karn Kumar · Answer 4 · 2018-12-29T12:20:20.823

If the data corresponds likely as given in example then below should work for you as per your requirement.

DataFrame:

>>> df
      Name        Number Location
0   Jayesh        980000     Pune
1     Ajay    9890989090   Mumbai
2   Manish    9999999999     Pune
3   Vikram  919000000000     Pune
4  Prakash  919999999999   Mumbai
5   Rakesh  919999999998   Mumbai
6   Naresh          9000     Pune

Result:

using str.match:

>>> df[df.Number.astype(str).str.match(r'^(\d{10}|\d{12})$', as_indexer=True)]
      Name        Number Location
1     Ajay    9890989090   Mumbai
2   Manish    9999999999     Pune
3   Vikram  919000000000     Pune
4  Prakash  919999999999   Mumbai
5   Rakesh  919999999998   Mumbai

OR

>>> df[df.Number.astype(str).str.match(r'^[0-9]{10,12}$', as_indexer=True)]
      Name        Number Location
1     Ajay    9890989090   Mumbai
2   Manish    9999999999     Pune
3   Vikram  919000000000     Pune
4  Prakash  919999999999   Mumbai
5   Rakesh  919999999998   Mumbai

gosuto · Answer 5 · 2018-12-29T20:13:58.780

0

I suggest to use the following regex pattern:

^\+91\d{10}$|^91\d{10}$|^\d{10}$

This is assuming there are no spaces and/or brackets in your Number column. The pattern makes sure the digit part is always 10 long (no more no less) and lets it be preceded by either +91 or 91.

to build a filtered dataframe you would then:

dff = df[df['Number'].astype(str).str.match(r'^\+91\d{10}$|^91\d{10}$|^\d{10}$')]

edited Dec 29 '18 at 20:13

answered Dec 29 '18 at 14:52

gosuto

5,422
6
36
57

Using python pandas how to do some analysis for identify valid mobile numbers

5 Answers5