1

I'm quite stuck with a code I'm writing in Python, I'm a beginner and maybe is really easy, but I just can't see it. Any help would be appreciated. So thank you in advance :)

Here is the problem: I have to read some special data files with an special extension .fen into a pandas DataFrame.This .fen files are inside a zipped file .fenx that contains the .fen file and a .cfg configuration file.

In the code I've written I use zipfile library in order to unzip the files, and then get them in the DataFrame. This code is the following

import zipfile
import numpy as np
import pandas as pd

def readfenxfile(Directory,File):

    fenxzip = zipfile.ZipFile(Directory+ '\\' + File, 'r')
    fenxzip.extractall()
    fenxzip.close()

    cfgGeneral,cfgDevice,cfgChannels,cfgDtypes=readCfgFile(Directory,File[:-5]+'.CFG')
    #readCfgFile redas the .cfg file and returns some important data. 
    #Here only the cfgDtypes would be important as it contains the type of data inside the .fen and that will become the column index in the final DataFrame.
    if cfgChannels!=None:        
        dtDtype=eval('np.dtype([' + cfgDtypes + '])')
        dt=np.fromfile(Directory+'\\'+File[:-5]+'.fen',dtype=dtDtype)
        dt=pd.DataFrame(dt)
    else:
        dt=[]

    return dt,cfgChannels,cfgDtypes

Now, the extract() method saves the unzipped file in the hard drive. The .fenx files can be quite big so this need of storing (and afterwards deleting them) is really slow. I would like to do the same I do now, but getting the .fen and .cfg files into the memory, not the hard drive.

I have tried things like fenxzip.read('whateverthenameofthefileis.fen')and some other methods like .open() from the zipfile library. But I can't get what .read() returns into a numpy array in anyway i tried.

I know it can be a difficult question to answer, because you don't have the files to try and see what happens. But if someone would have any ideas I would be glad of reading them. :) Thank you very much!

edumugi
  • 31
  • 6
  • Maybe this other answer will help: http://stackoverflow.com/questions/10908877/extracting-a-zipfile-to-memory#10909016. Once you have the ZipFile in memory, you can use BytesIO, which supports the file format, and use it in np to get your array. But, as you mention, if your zip file is so big that it takes long time to uncompress into disk, then I'm not sure doing the same in memory would really be convenient, you could end up with your process taking so much that the kernel would decide to kill it? – Alberto Apr 06 '17 at 14:48
  • 1
    Back up a bit. Focus on one file. How was it written, and what's the matching read method? Get that working first, then worry about handling the compression and large number of files. – hpaulj Apr 06 '17 at 15:46
  • @mydaemon `myzip = zipfile.ZipFile(io.BytesIO(open(Directory))) readfile=myzip.open(File[:-5]+'.fen')` This is how I tried to implement it with the io.Bytes but I don't really know how to correctly open the file so that BytesIO will take it. – edumugi Apr 06 '17 at 16:02
  • @hpaulji All these 3 files were 'invented' by another person, I will try to see the problem with him and maybe he can help me. The thing is that with the code I posted there is no problem opening them, but I can't get a way of doing the exact same thing with the memory instead of the hard drive – edumugi Apr 06 '17 at 16:05
  • @edumugi, I'll try to answer in a real answer to your last comment. – Alberto Apr 07 '17 at 07:41
  • @mydaemon I finally managed to do it with the tempfile library. I will edit my post to show the solution. Thank you very much for your time :) – edumugi Apr 07 '17 at 11:22
  • @edumugi: if you found your own answer, add it as an _answer_ to your question. – DSM Apr 07 '17 at 11:59

1 Answers1

2

Here is the solution I finally found in case it can be helpful for anyone. It uses the tempfile library to create a temporal object in memory.

import zipfile
import tempfile
import numpy as np
import pandas as pd

def readfenxfile(Directory,File,ExtractDirectory):


    fenxzip = zipfile.ZipFile(Directory+ r'\\' + File, 'r')

    fenfile=tempfile.SpooledTemporaryFile(max_size=10000000000,mode='w+b') 
     fenfile.write(fenxzip.read(File[:-5]+'.fen'))
     cfgGeneral,cfgDevice,cfgChannels,cfgDtypes=readCfgFile(fenxzip,File[:-5]+'.CFG')

    if cfgChannels!=None:        
        dtDtype=eval('np.dtype([' + cfgDtypes + '])')
        fenfile.seek(0)
        dt=np.fromfile(fenfile,dtype=dtDtype)
        dt=pd.DataFrame(dt)
    else:
        dt=[]
    fenfile.close()
    fenxzip.close()    
    return dt,cfgChannels,cfgDtypes
edumugi
  • 31
  • 6