I've just had a very similar issue. I had a bunch of xlsx files containing a single worksheet with between 2 and 4 million rows.
First, I went about extracting the relevant xml files (using bash script):
f='<xlsx_filename>'
unzip -p $f xl/worksheets/sheet1.xml > ${f%%.*}.xml
unzip -p $f xl/sharedStrings.xml > ${f%%.*}_strings.xml
This leads to all the xml file being placed in the working directory. Then, I used python to convert the xml to csv. This code makes use of ElementTree.iterparse() method. However, it can only work if every element gets cleared after it has been processed (see also here):
import pandas as pd
import numpy as np
import os
import xml.etree.ElementTree as et
base_directory = '<path/to/files>'
file = '<xml_filename>'
os.chdir(base_directory)
def read_file(base_directory, file):
ns = '{http://schemas.openxmlformats.org/spreadsheetml/2006/main}'
print('Working on strings file.')
string_it = et.parse(base_directory + '/' + file[:-4] + '_strings.xml').getroot()
strings = []
for st in string_it:
strings.append(st[0].text)
print('Working on data file.')
iterate_file = et.iterparse(base_directory + '/' + file, events=['start', 'end'])
print('Iterator created.')
rows = []
curr_column = ''
curr_column_elem = None
curr_row_elem = None
count = 0
for event, element in iterate_file:
if event == 'start' and element.tag == ns + 'row':
count += 1
print(' ', end='\r')
print(str(count) + ' rows done', end='\r')
if not curr_row_elem is None:
rows.append(curr_row_elem)
curr_row_elem = []
element.clear()
if not curr_row_elem is None :
### Column element started
if event == 'start' and element.tag == ns + 'c':
curr_column_elem = element
curr_column = ''
### Column element ended
if event == 'end' and element.tag == ns + 'c':
curr_row_elem.append(curr_column)
element.clear()
curr_column_elem.clear()
### Value element ended
if event == 'end' and element.tag == ns + 'v':
### Replace string if necessary
if curr_column_elem.get('t') == 's':
curr_column = strings[int(element.text)]
else:
curr_column = element.text
df = pd.DataFrame(rows).replace('', np.nan)
df.columns = df.iloc[0]
df = df.drop(index=0)
### Export
df.to_csv(file[:-4] + '.csv', index=False)
read_file(base_directory, file)
Maybe this helps you or anyone running into this issue. This is still relatively slow, however was working a lot better than basic "parse".