0

I need to read xlsx file 300gb. Count of rows ~ 10^9. I need to get values from one column. File consists of 8 columns. I want to do it as fast as it possible.

from openpyxl import load_workbook
import datetime
wb = load_workbook(filename="C:\Users\Predator\Downloads\logs_sample.xlsx", 
read_only=True)
ws = wb.worksheets[0]

count = 0
emails = []
p = datetime.datetime.today()
for row in ws.rows:
   count += 1
   val = row[8].value
   if count >= 200000: break
   emails.append(val)
q = datetime.datetime.today()
res = (q-p).total_seconds()
print "time: {} seconds".format(res)
emails = emails[1:]

Now cycle needs ~ 16 seconds to read 200.000 rows. And time complexity is O(n). So, for 10^6 rows will be read for 1.5 minutes nearly. Bit we have 10^9. And for this we must wait 10^3 * 1.5 = 1500 minutes = 25 hours. It's too bad... Help me, please, to solve this problem.

  • Very interesting as the max rows for single Excel .xlsx worksheet can only be 2^20, or 1,048,576 rows, not a billion! I thought from title you were mistaking a csv file which is not an Excel file and hence no limit. But you do reference an .xlsx. Would love to see this epic workbook. Maybe rows are spread across multiple sheets? Maybe file was built in code and not Excel.exe program? – Parfait Nov 12 '17 at 15:38

3 Answers3

1

I've just had a very similar issue. I had a bunch of xlsx files containing a single worksheet with between 2 and 4 million rows.

First, I went about extracting the relevant xml files (using bash script):

f='<xlsx_filename>'
unzip -p $f xl/worksheets/sheet1.xml > ${f%%.*}.xml
unzip -p $f xl/sharedStrings.xml > ${f%%.*}_strings.xml

This leads to all the xml file being placed in the working directory. Then, I used python to convert the xml to csv. This code makes use of ElementTree.iterparse() method. However, it can only work if every element gets cleared after it has been processed (see also here):

import pandas as pd
import numpy as np
import os
import xml.etree.ElementTree as et

base_directory = '<path/to/files>'
file = '<xml_filename>'

os.chdir(base_directory)

def read_file(base_directory, file):

    ns = '{http://schemas.openxmlformats.org/spreadsheetml/2006/main}'

    print('Working on strings file.')

    string_it = et.parse(base_directory + '/' + file[:-4] + '_strings.xml').getroot()
    strings = []

    for st in string_it:
        strings.append(st[0].text)

    print('Working on data file.')

    iterate_file = et.iterparse(base_directory + '/' + file, events=['start', 'end'])

    print('Iterator created.')

    rows = []
    curr_column = ''
    curr_column_elem = None
    curr_row_elem = None
    count = 0

    for event, element in iterate_file:

        if event == 'start' and element.tag == ns + 'row':

            count += 1

            print('                       ', end='\r')
            print(str(count) + ' rows done', end='\r')

            if not curr_row_elem is None:
                rows.append(curr_row_elem)

            curr_row_elem = []
            element.clear()

        if not curr_row_elem is None :
            ### Column element started
            if event == 'start' and element.tag == ns + 'c':
                curr_column_elem = element
                curr_column = ''

            ### Column element ended
            if event == 'end' and element.tag == ns + 'c':
                curr_row_elem.append(curr_column)
                element.clear()
                curr_column_elem.clear()

            ### Value element ended
            if event == 'end' and element.tag == ns + 'v':
                ### Replace string if necessary
                if curr_column_elem.get('t') == 's':
                    curr_column = strings[int(element.text)]
                else:
                    curr_column = element.text

    df = pd.DataFrame(rows).replace('', np.nan)
    df.columns = df.iloc[0]
    df = df.drop(index=0)
    
    ### Export 
    df.to_csv(file[:-4] + '.csv', index=False)

read_file(base_directory, file)

Maybe this helps you or anyone running into this issue. This is still relatively slow, however was working a lot better than basic "parse".

daveknave
  • 11
  • 2
0

One possible option would be to read the .xml data inside the .xslx directly.

.xlsx is actually a zipfile, containing multiple xml files.

All the distinct emails could be in xl/sharedStrings.xml, so you could try to extract them there.

To test (with a smaller file): add '.zip' to the name of your file and view the contents.

Of course, unzipping the whole 300GB file is not an option, so you would have to stream the compressed data (of that single file inside the zip), uncompress parts in memory and extract the data you want.

I don't know Python, so I can't help with a code example.


Also: emails.append(val) will create an array/list with 1 billion items.. It might be better to directly write those values to a file instead of storing them in an array (which will have to grow and reallocate memory each time).

Danny_ds
  • 11,201
  • 1
  • 24
  • 46
-2

To run such task efficiently you need to use a database. Sqlite can help you here.

Using pandas from, http://pandas.pydata.org/ and sqlite from http://sqlite.org/

You can install pandas with; pip or conda from Continuum.

import pandas as pd
import sqlite3 as sql

#create a connection/db
con = sql.connect('logs_sample.db')

#read you file
df = pd.read_excel("C:\\Users\\Predator\\Downloads\\logs_sample.xlsx")

#send it to the db
pd.to_sql('logs_sample',con,if_exists='replace')

See more, http://pandas.pydata.org

Mo. Atairu
  • 753
  • 8
  • 15