1

I scraped data to a dataframe that now looks like this:

    Name                Height(inches)
0   2 Snigdho Hasan     65
1   3 Michael Valentin  69
2   4 Andres Vargas     72
3   7 Jasper Diangco    70
4   9 Sayuj Zachariah   74
5   13 Omar Rezika      74
6   14 Gabriel Pjatak   75
7   16 Ryan Chabel      71

I removed the special characters, but need to remove that preceding index in front of each name.

df_final = pd.DataFrame()
df_final['Name'] = full_name
df_final['Name'].replace(r'\s+|\\n', ' ', regex = True, inplace = True)
df_final['Height(inches)'] = height[:min_length]

Any suggestions?

CrazyChucky
  • 3,263
  • 4
  • 11
  • 25
  • 3
    Do you mean the actual index of the dataframe, or are there digits in the name column? It's hard to tell what's in what column, the way you've displayed it. – CrazyChucky Apr 10 '21 at 14:26
  • The digits in the name column. – Daria Gurova Apr 10 '21 at 14:37
  • 1
    Just convert the name into an array separated by " " and then concatenate the value back into the string. name_arr = name.split(" ") name = name_arr[1] + " " + name_arr[2] – Markwin Apr 10 '21 at 14:47

2 Answers2

0

You could try using regex:

import re
string_1 = '612156 jose mauricio'
re.sub("^[\d-]*\s*",'',string_1)

Your output would be:

'jose mauricio'

You could use this code above to define a function and apply in your dataframe like:

def remove_first_numbers(text):
    return re.sub("^[\d-]*\s*",'',text).lstrip() 
#I'm adding the .lstrip() to remove any leading white spaces, just in case!

df_final['Name'] = df_final['Name'].apply(remove_first_numbers)
jmauricio
  • 115
  • 1
  • 7
  • You don't need that lambda; `.apply(remove_first_numbers)` would do the same thing. Though in any event, it's best to use vectorized column-based operations when feasible, as they're [much more efficient](https://towardsdatascience.com/efficient-pandas-apply-vs-vectorized-operations-91ca17669e84) than `apply` (which is basically a for loop). (This might not be an issue for small dataframes, but as a bonus, they're often more readable too.) – CrazyChucky Apr 10 '21 at 15:41
  • Thank you, editted the code using your considerations – jmauricio Apr 10 '21 at 20:59
0

Well, you're already stripping out whitespace:

df_final['Name'].replace(r'\s+|\\n', ' ', regex = True, inplace = True)
  • To match a newline (\n), you don't need that double slash as long as you're using a raw string literal (the r'').
  • Do you really want to replace \n with a space? I'd imagine you probably want it removed entirely. (Your example doesn't show the newlines, so it's hard to tell.)
  • Spaces are not recommended around the = of keyword arguments. Your code will still run just fine if you break this convention, but other programmers, at least, will have a harder time reading your code.
  • inplace is also not exactly recommended, and may even be deprecated in future. It seems like it would be more memory efficient, but in reality it often creates a copy under the hood anyway.

Assuming full_name in your code is the series (column) of names, this will remove all digits, then also clear all whitespace (spaces and/or newlines) from the left and right, leaving you with just the first and last name:

df_final['Name'] = full_name.replace(r'\d+', '', regex=True).str.strip()

(That's an immediate fix, but depending on how the original data is formatted, I suspect there's probably a way to scrape your data into a dataframe that avoids this ahead of time.)

CrazyChucky
  • 3,263
  • 4
  • 11
  • 25