0

i have a xlsx file which has 500,000 rows.I want to copy it into a csv file, but I am only able to copy 65k rows and then the program ends.here is my code i am using for copying data from xlsx file to csv file.amd it is taking too much time to even print 65k lines.

import openpyxl
import csv
import time
import collections

def csv_from_excel() :
    t1=(time.time())*1000
    workbook = openpyxl.load_workbook(filename = '5lac_xlsx.xlsx',optimized_read = True, use_iterators = True)
    sh=workbook.get_sheet_by_name('Sheet1')
    your_csv_file = open('your_csv_file.csv','wb')
    wr = csv.writer(your_csv_file,quoting=csv.QUOTE_ALL)
    t2=(time.time())*1000
    print (t2-t1)
    fp = open('your_csv_file.csv', 'wb')
    a = csv.writer(fp, delimiter=',')

    m1=(time.time())*1000
    count=0

    for row_t in sh.iter_rows():
        for cell in row_t :
            try :
                count=count+1
                wr.writerow([cell[3]])  
                #wr.writerow('\n')
            except :
                print "error"

    print "count"
    print count

    your_csv_file.close()

    m2=(time.time())*1000
    print (m2-m1)

csv_from_excel()
kadamb
  • 1,532
  • 3
  • 29
  • 55
  • are you sure you're code works? optimized_read is not an option AFAIK, you just need to pass `use_iterators = True`. Also, can you give a sample of data? I created an ad-hoc spreadsheat and had no problems reading and writting > 1,000,000 rows using your code (with some small adjustments) – Brad Feb 21 '14 at 16:24
  • @Brad yes it works though only for 65k rows... and my rows are like "8818888761.0" "8818888762.0" "8818888763.0" "8818888764.0" "8818888765.0" "8818888766.0" "8818888767.0" "8818888768.0" "8818888769.0" "8818888770.0" "8818888771.0" "8818888772.0" "8818888773.0" "8818888774.0" "8818888775.0" "8818888776.0" "8818888777.0" "8818888778.0" "8818888779.0" "8818888780.0" "8818888781.0" "8818888782.0" "8818888783.0" "8818888784.0" "8818888785.0" "8818888786.0" "8818888787.0" "8818888788.0" "8818888789.0" "8818888790.0" "8818888791.0" "8818888792.0" "8818888793.0" "8818888794.0" "8818888795.0" – kadamb Feb 22 '14 at 06:43
  • It's possible that your source file isn't quite right and is reporting the sheet as having 65k rows. If this is the case then it's possible to force `iter_rows` to ignore this. e.g. `for c in ws['A1':'B5000000']` – Charlie Clark Feb 23 '14 at 16:18

1 Answers1

1

A similar problem showed up in converting json to csv which was solved by reading the input file in chunks. The example code uses a small chunk size, while a large file worked with a chunk size of 0x800000

While the sample code is for JSON, you should be able to convert it to handle your Excel file using the same principle, except using Excel processors such as xlrd or one of the other Python Excel packages.

from json import JSONDecoder
from functools import partial


def json_parse(fileobj, decoder=JSONDecoder(), buffersize=2048):
  buffer = ''
  for chunk in iter(partial(fileobj.read, buffersize), ''):
     buffer += chunk
     while buffer:
         try:
             result, index = decoder.raw_decode(buffer)
             yield result
             buffer = buffer[index:]
         except ValueError:
             # Not enough data to decode, read more
             break

This function will read chunks from the given file object in buffersize chunks, and have the decoder object parse whole JSON objects from the buffer. Each parsed object is yielded to the caller.

Use it like this:

with open('yourfilename', 'r') as infh:
  for data in json_parse(infh):
    # process object
Community
  • 1
  • 1
sabbahillel
  • 4,357
  • 1
  • 19
  • 36
  • As Excel files are actually a zip archve this kind of approach is not possible in client code. However, using `load_workbook(filename, use_iterators=True)` will do much the same. Note, depending on what you're doing with the data you're reading this won't solve all your memory problems. – Charlie Clark Feb 24 '14 at 08:48