handling zip files using python library pandas

Question

we have a big [ file_name.tar.gz] file here big in the sense our machine can not handle in go, it has three type of files inside it, let us say [first_file.unl, second_file.unl, thrid_file.unl]

background about unl extension: pd.read_csv able to read the file successfully without giving any kind of errors.

i am trying below steps in order to accomplish the tasks

step 1:

all_files = glob.glob(path + "/*.gz")

above step able to list all three types of file now using below code to process further

step 2:

li = []

for filename in x:
    df_a = pd.read_csv(filename, index_col= False, header=0,names= header_name,
                 low_memory=False, sep ="|")
    li.append(df_a)

step 3:

frame = pd.concat(li, axis=0, ignore_index= True)

all three steps will work perfectly if

we have small data that could fit in our machine memory
we have only one type of files inside zip file

how do we overcome this problem, please help

we are expecting to have a code, that has ability to read a file in chunk for particular file type and create data frame for the same.

also please do advise, apart from pandas libary, is there any other approaches or library that could handle this more efficiently considering our data residing in linux server.

zip file? There is no zip file here. You mention only a .tar.gz file. — Mark Adler, Dec 10 '21 at 17:30

score 0 · Answer 1 · answered Dec 10 '21 at 04:51

0

You can refer to this link: How do I read a large csv file with pandas?

In general, you can try with chunks

For better performance, I suggest to use Dask or Pyspark

answered Dec 10 '21 at 04:51

duyvan

11
2

score 0 · Answer 2 · answered Dec 10 '21 at 17:33

0

Use tarfile's open, next, and extractfile to get the entries, where extractfile returns a file object with which you can read that entry. You can provide that object to read_csv.

answered Dec 10 '21 at 17:33

Mark Adler

101,978
13
118
158

handling zip files using python library pandas

2 Answers2