1

I have been working on a python code, which reads a csv file with 800 odd rows and around 17000 columns. I would like to check each entry in the csv file and see if this number is bigger than or smaller than a value, if it is, I assign a default value. I used pandas and worked with dataframes, apply and lambda functions. It takes me 172 minutes to finish going through all entries in the csv file. Is it normal? Is there any faster way to do this?. I am using Python 2.7. I don't know if it helps, but I am running it on a windows 10 machine with 32GB ram. Thanks in advance for the help.

The code is attached below.


def do_something(some_dataframe):
    col = get_req_colm(some_dataframe)
    modified_dataframe = pd.DataFrame()
    for k in col:
        temp_data = some_dataframe.apply(lambda x: check_for_range(x[k]), axis=1).tolist()
        dictionary = {}
        dictionary[str(k)] = temp_data
        temp_frame = pd.DataFrame(dictionary)
        modified_dataframe = pd.concat([modified_dataframe, temp_frame], axis=1)
    return modified_dataframe

def check_for_range(var):
    var = int(var)
    try:
        if var == 0:
            return 0
        if var == 1 or var == 4:
            return 1
        if var == 2 or var == 3 or var == 5 or var == 6:
            return 2
    except:
        print('error')

def get_req_colm(df):
    col = list(df)
    try:
        col.remove('index/Sample count')
        col.remove('index / Sample')
        col.remove('index')
        col.remove('count')
    except:
        pass
    return col

df_after_doing_something = do_something(some_dataframe)
df_after_doing_something.to_csv(output_folder + '\\df_after_doing_something.csv', index=False)

Kishore
  • 11
  • 2
  • Your indentation is mixed up. Can you please fix it? – erip May 21 '20 at 13:19
  • Sorry for that, Is it better now? – Kishore May 21 '20 at 13:22
  • There are a whole bunch of issues in here. In `check_for_range`, your try block can never throw (but the thing outside of it will). I'm not sure what the 'col.remove' bit is doing. Why don't you apply the functions across all columns with the `apply`? – erip May 21 '20 at 13:31
  • @erip Can I apply to all the columns at once? I am applying to each column at a time using the for loop right now. – Kishore May 21 '20 at 13:35

1 Answers1

0

using pandas,for cvs data, makes it efficient. but your code is not efficient.it will be faster if you try code given blow.

def do_something(some_dataframe):
    col = get_req_colm(some_dataframe)
    col = col.to_numpy()
    np_array = np.zeros_like(col)
    for i in range(len(col)):
        k = np_array[i]
        temp_data = np.zeros_like()
        temp_data[k == 1 or k == 4] = 1
        temp_data[k == 2 or k == 3 or k == 5 or k == 6] = 2
        np_array[i] = k
    modified_dataframe = pandas.Dataframe(np_array)
    return modified_dataframe

def get_req_colm(df):
    col = list(df)
    try:
        col.remove('index/Sample count')
        col.remove('index / Sample')
        col.remove('index')
        col.remove('count')
    except:
        pass
    return col

it will work perfectly and don't forget to import numpy.

import numpy as np

if you didn't get this go and check some numpy tutorial and do it then. the link given below will help you otherwise

Replacing elements in a numpy array when there are multiple conditions

ajay chawla
  • 63
  • 1
  • 11