0

I am trying to concatenate two large matrices of numbers, the first one: features is an np.array shaped 1238,72, the other is being loaded from a .json file as it can be seen on the second line below, it is shaped 1238, 768. I need to load, concatenate, re-index, split into folds and save each fold in its own folder. The problem is that I get Killed on the very first step (reading the .json content into bert)

with open(bert_dir+"/output4layers.json", "r+") as f:
    bert = [json.loads(l)['features'][0]['layers'][0]['values'] for l in f.readlines()]

    bert_post_data = np.concatenate((features,bert), axis=1)
    del bert

    bert_post_data = [bert_post_data[i] for i in index_shuf]
    bert_folds = np.array_split(bert_post_data, num_folds)
    for i in range(num_folds):
        print("saving bert fold ",str(i), bert_folds[i].shape)
        fold_dir = data_dir+"/folds/"+str(i)
        save_p(fold_dir+"/bert", bert_folds[i])

Is there a way I can do this more memory efficiently? I mean, there's gotta be a better way... pandas, json lib?

Thanks for your time and attention

Lucas Azevedo
  • 1,867
  • 22
  • 39

2 Answers2

1

Try:

bert = [json.loads(line)['features'][0]['layers'][0]['values'] for line in f]

this way you at least do not read the whole file in memory at once - that being said if the file is huge you must further process what you store in bert

Mr_and_Mrs_D
  • 32,208
  • 39
  • 178
  • 361
  • yes, I'm reading other similar issues and seems like the `readlines()` is a big mistake. Apparently I should do the processing line per line. The problem is that I need to apply the reindexing... how can I possibly do that? – Lucas Azevedo Apr 27 '20 at 13:27
  • What do you mean to apply the reindexing @LucasAzevedo ? – Mr_and_Mrs_D Apr 27 '20 at 14:06
  • before the code snipped shown above, I have "shuffled" the labels assigned to this data. the `index_shuf` variable keeps the order used on that shuffling so I can apply the same shuffling to the matrix resulting from the concatenation of `bert` and `features`. Is there a way of performing this without having to load the whole thing into memory? – Lucas Azevedo Apr 27 '20 at 15:26
  • You can just do `bert_post_data[index_shuf]` – Mr_and_Mrs_D Apr 27 '20 at 16:41
0

I've found this solution when searching for a similar problem. It was not the most voted in the particular question but in my opinion it works better than anything.

The idea is simple: instead of saving a list of strings (one for each line in the document) you save a list of file index positions that refer to each of those lines, then when you want to access its content, you just have to seek to this memory position. To do so, a class LineSeekableFile comes in handy.

The only problem is that you need to keep the file object (not the whole file!) open during the whole process.

class LineSeekableFile:
    def __init__(self, seekable):
        self.fin = seekable
        self.line_map = list() # Map from line index -> file position.
        self.line_map.append(0)
        while seekable.readline():
            self.line_map.append(seekable.tell())

    def __getitem__(self, index):
        # NOTE: This assumes that you're not reading the file sequentially.
        # For that, just use 'for line in file'.
        self.fin.seek(self.line_map[index])
        return self.fin.readline()

Then to access it:

b_file = bert_dir+"/output4layers.json"

fin = open(b_file, "rt")
BertSeekFile = LineSeekableFile(fin)

b_line = BertSeekFile[idx] #uses the __getitem__ method

fin.close()
Lucas Azevedo
  • 1,867
  • 22
  • 39