0

I am trying to read file from the disk and then split it into [features and labels]

def generator(data_path):
   x_text=[]
   counter=0
   _y=[]
   for root, dirs, files in os.walk(data_path):
       for _file in files:
           if _file.endswith(".txt"):
               _contents = list(open(data_path+_file, "r", encoding="UTF8",errors='ignore').readlines())
               _contents = [s.strip() for s in _contents]
               x_text=x_text+_contents

               y_examples=[0,0,0]
               y_examples[counter]=1
               y_labels = [y_examples for s in _contents]
               counter+=1

               _y=_y+y_labels

   return [x_text, _y]

I have huge 3.5GB of data in the disk and I cant read it into the memory at the same time. How can I modify this code to generate n files at a time for processing.

for X_batch, y_batch in generator(data_path):
         feed_dict = {X: X_batch, y: y_batch}

Is there an more efficient way to read this huge data in tensorflow?

Rohit
  • 10,056
  • 7
  • 50
  • 82
  • 1
    https://stackoverflow.com/questions/6475328/read-large-text-files-in-python-line-by-line-without-loading-it-in-to-memory – halfelf Aug 06 '18 at 01:21
  • The question shouldn't be about efficiency. It's about whether the tensorflow model can work with the data in pieces, or must it have all the data in memory at once. – hpaulj Aug 06 '18 at 02:32
  • @hpaulj Can you please guide me how to approach this problem(perhaps a tutorial). I am new to tensorflow – Rohit Aug 06 '18 at 04:47

0 Answers0