0

Let us say i have an excel file with 100k rows. My code is trying to read it row by row, and for each row do computation (including benchmark of how long it takes to perform each row). Then, my code will produce an array of results, with 100k rows. I did my python code but it is not efficient and taking me several days and also the benchmark results getting worse due to high consumption of memory i guess. Please see my attempt and let me know how to improve it.

My code save results=[] and only write it at the end. Also, at the start i store the whole excel file in worksheet.. I think like this will cause memory issue since my excel has very large text in cells (not only numbers).

ExcelFileName = 'Data.xlsx'
workbook = xlrd.open_workbook(ExcelFileName)
worksheet = workbook.sheet_by_name("Sheet1") # We need to read the data

num_rows = worksheet.nrows  #Number of Rows
num_cols = worksheet.ncols  #Number of Columns
results=[]

    for curr_row in range(1,num_rows,1):
        row_data = []
        for curr_col in range(0, num_cols, 1):
            data = worksheet.cell_value(curr_row, curr_col) # Read the data in the current cell
            row_data.append(data)

        #### do computation here ####
        ## save results like results+=[]

    ### save results array in dataframe and then print it to excel
    df = pd.DataFrame(results)
    writer = pd.ExcelWriter("XX.xlsx", engine="xlsxwriter")
    df.to_excel(writer, sheet_name= 'results')
    writer.save()

What i would like is to read the first row from excel and store it in memory, do the calculation, get the result and save it into excel,, then go for the second row,,, without keep memory so busy. By doing so, i will not have results array containing 100k rows, since each loop i erase it.

MWH
  • 353
  • 1
  • 3
  • 18
  • have you looked into ```pd.read_excel```? Its able to read the entire excel for you, and create a df based on the data, then do the calculation on all the rows at the same time? – Ben Pap Apr 25 '19 at 19:11
  • Thanks. i would like to calculate each row individually if it makes sense. – MWH Apr 25 '19 at 19:17
  • I understand, however, pandas specializes in vectorizing functions, thus would save you a ton of time. Personally, unless each row is a different function, it doesn't make sense to use a loop. – Ben Pap Apr 25 '19 at 19:22
  • Thanks i will look at it. yes the problem is that each row is different, so i have to loop over one by one. – MWH Apr 25 '19 at 19:24

2 Answers2

0

To solve the issue about loading the input file into memory, I would look into using a generator. A Generator works by iterating over any iterable, but only returning the next element instead of the entire iterable. In your case, this would return only the next row from your .xlsx file, instead of keeping the entire file in memory.

However, this will not solve the issue of having a very large "results" array. Unfortunately, updating a .csv or .xlsx file in as you go will take a very long time, significantly longer than updating the object in memory. There is a trade off here, you can either use up lots of memory by updating your "results" array and then writing it all to a file at the end, or you can very slowly update a file in the file system with the results as you go at the cost of much slower execution.

  • Thanks, the issue my calculation involves benchmark so keeping the entire results will affect benchmark result as memory goes up (hence the calculation to get results consuming memory also and not that easy). accurate benchmark might be more important than overall execution time. – MWH Apr 25 '19 at 19:32
0

For this kind of operation you are probably better off loading the csv directly into a DataFrame, there are several methods for dealing with large files in pandas that are detailed here, How to read a 6 GB csv file with pandas. Which method you choose will have a lot to do with the type of computation you need to do, since you seem to be processing one row at a time, using chunks will probably be the way to go.

Pandas has a lot of built in optimization for dealing with operations on large sets of data, so the majority of the time you will find increased performance working with data within a DataFrame or Series than you will using pure Python. For the best performance consider vectorizing your function or looping using the apply method, which allows pandas to apply the function to all rows in the most efficient way possible.