0

I have a df that is 2,000 lines long, it is taking me about 5 minutes to map 2,000 IDs to the corresponding names, which is way too long. I'm trying to figure out a way to reduce the mapping time. One option I want to try is mapping the IDs as I make the dictionary- and not storing more than one line at a time in the dictionary.

Here is the process I'm using:

df_IDs=
studentID  grade  age
123        12th   18
432        11th   17
421        11th   16

And I want to replace the 'studentID' column with the student names that I can get from a mapping file (student_directory_df).

I created functions that will make a dictionary and map the IDs to the names:

dicts={}
def search_tool_student(df, column, accession):
    index=np.where(df[column].str.contains(accession, na=False))
    if len(index[0])==0:
        done=""
        pass
    else:
        done=df.iloc[index]
    return(done)

def create_dict(x): 
    result_df = search_tool_student(student_directory_df, 'studentID', x)
    if (len(result_df) == 0): 
        print('bad match found: '+ x)
    else: 
        student_name = result_df['name'].iloc[0]
        dicts[x] = student_name
    return(dicts)

def map_ID(df_IDs):
    studentIDs=df_IDs['studentID'] 
    new_dict=list(map(create_dict, studentIDs))[0]
    df_IDs['studentID']=df_IDs['studentID'].map(new_dict).fillna(df_IDs['studentID'])
    return(df_IDs)

desired output

studentID    grade  age
sally        12th   18
joe          11th   17
sarah        11th   16
youtube
  • 265
  • 1
  • 7
  • For each student in your `df_IDs`, you will do one search through `student_directory_df`, which takes linear time. Therefore, the overall search will take quadratic time (and str.contains is not fast to begin with.) – Nick ODell Aug 12 '22 at 02:23
  • If you have the student IDs and a list of `student ID, name` pairs, convert the second into a data frame and just merge. – ifly6 Aug 12 '22 at 03:13

1 Answers1

2

Pandas works much better when you think of DataFrames not as grids of cells to index into, but rather collections of columns (Series) than you can manipulate. The more you can offload into a single pandas function, the more likely it is to be optimized and more performant.

In this case, you want use Series.map as best you can. Series.map can take a function, dictionary-like mapping, or a Series.

Since you already have the student names and ids in a dataframe, I would recommend something like the following:

# create a series to do the mapping
id_to_name_map = student_directory_df.set_index('studentID')['name']
# pass it to map all at once
df_IDs['student_name'] = df_IDs['studentID'].map(id_to_name_map)

Hope that helps!

user5002062
  • 541
  • 3
  • 8