0

I am new to stackoverflow. Is there a way to speed this code up with vectorization, As I am not so advanced how can I do it? I am currently working on a csv dataset which I import using pandas. There are a couple of functions that allow to create labels and represent data for biases. The code takes some time to load and I am looking at options to speed it up in as much as possible. Thanks.

def create_labels(self):
        sensitive_label = {}
        for i in set(self.X_test[sensitive]):
            text = “Please Enter Label for Group” +” “+ str(i)+“: ”
            label = input(text)
            sensitive_label[i]=label
        return(sensitive_label)


def representation(self,sensitive, labels, predictions):
        full_table = self.X_test.copy()
        sens_df = {}
    #Output is going to be a table
        for i in labels:
            full_table[‘p’] = predictions
            full_table[‘t’] = self.y_test
            sens_df[labels[i]] = full_table[full_table[sensitive]==i] #one table stored for female and one for male
        contigency_p = pd.crosstab(full_table[sensitive], full_table[‘t’])
        cp, pp, dofp, expectedp = chi2_contingency(contigency_p)
        contigency_pct_p = pd.crosstab(full_table[sensitive], full_table[‘t’], normalize=‘index’)#p value of contigency table
        sens_rep = {}
        for i in labels:
            sens_rep[labels[i]] = (self.X_test[sensitive].value_counts()/self.X_test[sensitive].value_counts().sum())[i]
        labl_rep = {}
        for i in labels:
            labl_rep[str(i)] = (self.y_test.value_counts()/self.y_test.value_counts().sum())[i]
        fig = make_subplots(rows=1, cols=2)
        for i in labels:
            fig.add_trace(go.Bar(
            showlegend=False,
            x = [labels[i]],
            y= [sens_rep[labels[i]]]),row=1,col=1)
            fig.add_trace(go.Bar(
            showlegend=False,
            x = [str(i)],
            y= [labl_rep[str(i)]],
            marker_color=[‘orange’,‘blue’][i]),row=1,col=2)
        c, p, dof, expected = chi2_contingency(contigency_p)
        cont_table = (tabulate(contigency_pct_p.T, headers=labels.values(), tablefmt=‘fancy_grid’))
        return cont_table, sens_df, fig, p
            #sens_df dataset based on the senstive labels
  • First step is to [profile the code](https://stackoverflow.com/q/582336/1609514) to determine which parts are taking up the most time. Without an executable version of your code with a data sample it's hard to know how to speed up. – Bill Jul 23 '22 at 19:24
  • It seems to me this question is more suited to be asked in the [Code Review Forum](https://codereview.stackexchange.com/). Code Review is a question and answer site for peer programmer code reviews. Please read the relevant guidance related to how to properly ask questions on this site before posting your question. – itprorh66 Jul 23 '22 at 19:45
  • Bill's point is really important here. I'm not sure that this code would be accepted on code review, as posting a large workflow with the question being "please make my full workflow faster" isn't valid anywhere. Identify the slowest part of your workflow, or perhaps just a single line that you think could be vectorized. Read the guide to creating a [mre] in full, and check out this [guide which is specific to pandas](/q/20109391/). Feel free to ask again when you're ready to work on a specific problem with [specific, narrowly-defined goals](//meta.stackoverflow.com/q/412875) :) – Michael Delgado Jul 23 '22 at 22:03

0 Answers0