I have been working on a python code, which reads a csv file with 800 odd rows and around 17000 columns. I would like to check each entry in the csv file and see if this number is bigger than or smaller than a value, if it is, I assign a default value. I used pandas and worked with dataframes, apply and lambda functions. It takes me 172 minutes to finish going through all entries in the csv file. Is it normal? Is there any faster way to do this?. I am using Python 2.7. I don't know if it helps, but I am running it on a windows 10 machine with 32GB ram. Thanks in advance for the help.
The code is attached below.
def do_something(some_dataframe):
col = get_req_colm(some_dataframe)
modified_dataframe = pd.DataFrame()
for k in col:
temp_data = some_dataframe.apply(lambda x: check_for_range(x[k]), axis=1).tolist()
dictionary = {}
dictionary[str(k)] = temp_data
temp_frame = pd.DataFrame(dictionary)
modified_dataframe = pd.concat([modified_dataframe, temp_frame], axis=1)
return modified_dataframe
def check_for_range(var):
var = int(var)
try:
if var == 0:
return 0
if var == 1 or var == 4:
return 1
if var == 2 or var == 3 or var == 5 or var == 6:
return 2
except:
print('error')
def get_req_colm(df):
col = list(df)
try:
col.remove('index/Sample count')
col.remove('index / Sample')
col.remove('index')
col.remove('count')
except:
pass
return col
df_after_doing_something = do_something(some_dataframe)
df_after_doing_something.to_csv(output_folder + '\\df_after_doing_something.csv', index=False)