What is the fastest way to manipulate large csv files in Python?

Question

I have been working on a python code, which reads a csv file with 800 odd rows and around 17000 columns. I would like to check each entry in the csv file and see if this number is bigger than or smaller than a value, if it is, I assign a default value. I used pandas and worked with dataframes, apply and lambda functions. It takes me 172 minutes to finish going through all entries in the csv file. Is it normal? Is there any faster way to do this?. I am using Python 2.7. I don't know if it helps, but I am running it on a windows 10 machine with 32GB ram. Thanks in advance for the help.

The code is attached below.


def do_something(some_dataframe):
    col = get_req_colm(some_dataframe)
    modified_dataframe = pd.DataFrame()
    for k in col:
        temp_data = some_dataframe.apply(lambda x: check_for_range(x[k]), axis=1).tolist()
        dictionary = {}
        dictionary[str(k)] = temp_data
        temp_frame = pd.DataFrame(dictionary)
        modified_dataframe = pd.concat([modified_dataframe, temp_frame], axis=1)
    return modified_dataframe

def check_for_range(var):
    var = int(var)
    try:
        if var == 0:
            return 0
        if var == 1 or var == 4:
            return 1
        if var == 2 or var == 3 or var == 5 or var == 6:
            return 2
    except:
        print('error')

def get_req_colm(df):
    col = list(df)
    try:
        col.remove('index/Sample count')
        col.remove('index / Sample')
        col.remove('index')
        col.remove('count')
    except:
        pass
    return col

df_after_doing_something = do_something(some_dataframe)
df_after_doing_something.to_csv(output_folder + '\\df_after_doing_something.csv', index=False)

There are a whole bunch of issues in here. In `check_for_range`, your try block can never throw (but the thing outside of it will). I'm not sure what the 'col.remove' bit is doing. Why don't you apply the functions across all columns with the `apply`? — erip, May 21 '20 at 13:31
@erip Can I apply to all the columns at once? I am applying to each column at a time using the for loop right now. — Kishore, May 21 '20 at 13:35

score 0 · Answer 1 · answered May 21 '20 at 14:06

using pandas,for cvs data, makes it efficient. but your code is not efficient.it will be faster if you try code given blow.

def do_something(some_dataframe):
    col = get_req_colm(some_dataframe)
    col = col.to_numpy()
    np_array = np.zeros_like(col)
    for i in range(len(col)):
        k = np_array[i]
        temp_data = np.zeros_like()
        temp_data[k == 1 or k == 4] = 1
        temp_data[k == 2 or k == 3 or k == 5 or k == 6] = 2
        np_array[i] = k
    modified_dataframe = pandas.Dataframe(np_array)
    return modified_dataframe

def get_req_colm(df):
    col = list(df)
    try:
        col.remove('index/Sample count')
        col.remove('index / Sample')
        col.remove('index')
        col.remove('count')
    except:
        pass
    return col

it will work perfectly and don't forget to import numpy.

import numpy as np

if you didn't get this go and check some numpy tutorial and do it then. the link given below will help you otherwise

Replacing elements in a numpy array when there are multiple conditions

Hello ajay, Thanks for your answer, but col is a list and it has no attribute "to_numpy" — Kishore, May 27 '20 at 15:39
https://www.geeksforgeeks.org/convert-python-list-to-numpy-arrays/ — ajay chawla, May 28 '20 at 05:29

What is the fastest way to manipulate large csv files in Python?

1 Answers1