Optimize data conversion program to avoid memory error

Question

I am very new to Python and have idea how to optimize this program in order to avoid a memory error.

I am trying to read data from two workbooks: raw_data and mapping. I want to convert the raw_data into a new spreadsheet using the mapping document to transform the data. So I load in the workbooks, create data dictionaries out of the mapping data and start converting. However, I run into a memory error.

Is there any way to optimize the code below to avoid this error?

import openpyxl
from openpyxl.utils import get_column_letter

mapping = openpyxl.load_workbook(r'C:...\mapping.xlsx') #load mapping doc
wb = openpyxl.load_workbook(r'C:...\raw_data.xlsx') #load raw data


sheet = wb.active #look at the active sheet in the raw data file
user_map_raw = mapping.get_sheet_by_name('User ID Mapping') #for user ids
item_map_raw = mapping.get_sheet_by_name('Item ID Mapping') #for item ids
...other mappings here

def load(sheet): 

    user_dict = {}
    print "creating user dictionary..."
    for row in range(1, user_map_raw.max_row+1):
        old_name = user_map_raw['A' + str(row)].value #old user name
        new_name = user_map_raw['B' + str(row)].value #new user name
        user_dict[old_name] = new_name #old name is key for the new name

    item_dict = {}
    print "creating item id dictionary..."
    for row in range(1, item_map_raw.max_row+1):
        old_item = item_map_raw['A' + str(row)].value #old item id
        new_item = item_map_raw['B' + str(row)].value #new item id
        item_dict[old_item] = new_item #old item id is key for new item id

    raw = [] #empty list to store data before writing to new file
    for row in range(2, sheet.max_row+1): #loop thru raw data and map
        print "loading row %s" % row
        user_ID = user_dict[sheet['A' + str(row)].value]
        item_type = sheet['B' + str(row)].value
        item_ID = item_dict[sheet['C' + str(row)].value]
        ...other transformations here
        add = [user_ID, item_type, item_ID, ...]
        raw.append(add) #add transformed data to list

    new = openpyxl.Workbook() #create new workbook
    output = new.active #select the active sheet
    for i in range(len(raw)): #loop through transformed data list
        "print writing row %s" %i
        for j in range(len(raw[i])): #write to new sheet
            output[get_column_letter(j+1) + str(i+1)] = raw[i][j]
    new.save('new_doc.xlsx')

load(sheet)

score 1 · Accepted Answer · edited May 23 '17 at 10:29

The primary optimization would be to avoid loading the entire primary workbook into memory as well as avoiding storing the whole result in memory before writing. There are write_only and read_only modes for openpyxl workbooks that can save a lot of memory at runtime by implementing optimized representations with reduced features and support for iterators. Since you are writing a new file instead of editing in place these modes can make a big difference.

wb = openpyxl.load_workbook(r'C:...\raw_data.xlsx', read_only=True) 
sheet = wb.active

# mapping related code...

from openpyxl.writer.write_only import WriteOnlyCell
wb = openpyxl.Workbook(write_only=True) #create new workbook
ws = new.create_sheet()

for row in sheet.iter_rows(row_offset=1):
    for i, cell in enumerate(row):
        if i = 0: #A
            user_ID = WriteOnlyCell(ws, user_dict[cell.value])
        elif i = 1: #B
            item_type = WriteOnlyCell(ws, cell.value)
        elif i = 2: #C
            item_ID = WriteOnlyCell(ws, item_dict[cell.value])
        else:
            break
    ws.append([user_ID, item_type, item_ID])

wb.save('new_doc.xlsx')

Have to iterate over the cells since it's a generator so can't use subscript. Seems clunky, but I'm tired.

For a small savings, if you are using Python 2.x, whenever you use the range function a list is created in memory that's as big as your range, which if you have a very large spreadsheet, could fill up your RAM. In your case you can probably use xrange which generates each iteration dynamically to save some memory.

Hey thanks - this works. My only question: why did you use row_offset = 2? — kjmerf, Feb 27 '17 at 16:18
In the question example you loop through the main sheet skipping a couple rows and starting at row 2: `range(2, sheet.max_row+1):`. Hopefully `row_offset` does the same thing. Hard to tell from the openpyxl docs. You may need to verify you are getting all the rows you want. — systemjack, Feb 27 '17 at 17:18
Right, I think it should be row_offset = 1. But anyway working now. Thanks for all your effort on this. Key takeaways: read the file as read-only and write it as write-only. Use iter_rows to extract cell values — kjmerf, Feb 27 '17 at 17:25

score 1 · Answer 2 · answered Feb 23 '17 at 09:43

1

You can use read-only mode when reading your source files and write-only mode to write the results. This will minimise memory use.

answered Feb 23 '17 at 09:43

Charlie Clark

18,477
4
49
55

Optimize data conversion program to avoid memory error

2 Answers2