-1

we have a big [ file_name.tar.gz] file here big in the sense our machine can not handle in go, it has three type of files inside it, let us say [first_file.unl, second_file.unl, thrid_file.unl]

background about unl extension: pd.read_csv able to read the file successfully without giving any kind of errors.

i am trying below steps in order to accomplish the tasks

step 1:

all_files = glob.glob(path + "/*.gz")

above step able to list all three types of file now using below code to process further

step 2:

li = []

for filename in x:
    df_a = pd.read_csv(filename, index_col= False, header=0,names= header_name,
                 low_memory=False, sep ="|")
    li.append(df_a)

step 3:

frame = pd.concat(li, axis=0, ignore_index= True)

all three steps will work perfectly if

  1. we have small data that could fit in our machine memory
  2. we have only one type of files inside zip file

how do we overcome this problem, please help

we are expecting to have a code, that has ability to read a file in chunk for particular file type and create data frame for the same.

also please do advise, apart from pandas libary, is there any other approaches or library that could handle this more efficiently considering our data residing in linux server.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158

2 Answers2

0

You can refer to this link: How do I read a large csv file with pandas?

In general, you can try with chunks

For better performance, I suggest to use Dask or Pyspark

duyvan
  • 11
  • 2
0

Use tarfile's open, next, and extractfile to get the entries, where extractfile returns a file object with which you can read that entry. You can provide that object to read_csv.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158