Remove digits in column of dataframe

Question

I scraped data to a dataframe that now looks like this:

    Name                Height(inches)
0   2 Snigdho Hasan     65
1   3 Michael Valentin  69
2   4 Andres Vargas     72
3   7 Jasper Diangco    70
4   9 Sayuj Zachariah   74
5   13 Omar Rezika      74
6   14 Gabriel Pjatak   75
7   16 Ryan Chabel      71

I removed the special characters, but need to remove that preceding index in front of each name.

df_final = pd.DataFrame()
df_final['Name'] = full_name
df_final['Name'].replace(r'\s+|\\n', ' ', regex = True, inplace = True)
df_final['Height(inches)'] = height[:min_length]

Any suggestions?

Do you mean the actual index of the dataframe, or are there digits in the name column? It's hard to tell what's in what column, the way you've displayed it. — CrazyChucky, Apr 10 '21 at 14:26
Just convert the name into an array separated by " " and then concatenate the value back into the string. name_arr = name.split(" ") name = name_arr[1] + " " + name_arr[2] — Markwin, Apr 10 '21 at 14:47

jmauricio · Answer 1 · 2021-04-10T20:58:45.003

0

You could try using regex:

import re
string_1 = '612156 jose mauricio'
re.sub("^[\d-]*\s*",'',string_1)

Your output would be:

'jose mauricio'

You could use this code above to define a function and apply in your dataframe like:

def remove_first_numbers(text):
    return re.sub("^[\d-]*\s*",'',text).lstrip() 
#I'm adding the .lstrip() to remove any leading white spaces, just in case!

df_final['Name'] = df_final['Name'].apply(remove_first_numbers)

edited Apr 10 '21 at 20:58

answered Apr 10 '21 at 14:45

jmauricio

115
1
7

You don't need that lambda; `.apply(remove_first_numbers)` would do the same thing. Though in any event, it's best to use vectorized column-based operations when feasible, as they're [much more efficient](https://towardsdatascience.com/efficient-pandas-apply-vs-vectorized-operations-91ca17669e84) than `apply` (which is basically a for loop). (This might not be an issue for small dataframes, but as a bonus, they're often more readable too.) – CrazyChucky Apr 10 '21 at 15:41
Thank you, editted the code using your considerations – jmauricio Apr 10 '21 at 20:59

CrazyChucky · Answer 2 · 2021-05-31T22:12:31.380

Well, you're already stripping out whitespace:

df_final['Name'].replace(r'\s+|\\n', ' ', regex = True, inplace = True)

To match a newline (\n), you don't need that double slash as long as you're using a raw string literal (the r'').
Do you really want to replace \n with a space? I'd imagine you probably want it removed entirely. (Your example doesn't show the newlines, so it's hard to tell.)
Spaces are not recommended around the = of keyword arguments. Your code will still run just fine if you break this convention, but other programmers, at least, will have a harder time reading your code.
inplace is also not exactly recommended, and may even be deprecated in future. It seems like it would be more memory efficient, but in reality it often creates a copy under the hood anyway.

Assuming full_name in your code is the series (column) of names, this will remove all digits, then also clear all whitespace (spaces and/or newlines) from the left and right, leaving you with just the first and last name:

df_final['Name'] = full_name.replace(r'\d+', '', regex=True).str.strip()

(That's an immediate fix, but depending on how the original data is formatted, I suspect there's probably a way to scrape your data into a dataframe that avoids this ahead of time.)

Remove digits in column of dataframe

2 Answers2