0

When I run my FinBert model it always crashes the RAM in Google Collab at outputs = model(**input)

from transformers.utils.dummy_pt_objects import HubertModel
import textwrap
# Reads all files at once but you will have to upload it again
import pandas as pd
import glob
import numpy as np
import torch


all_files = glob.glob("*.csv")
tickerList = []
textList = []
model.eval()
for filename in all_files:
    # Get ticker symbol
    ticker = filename.split('_', 1)[0].replace('.', '').upper()
    #Read file into dataframe
    df = pd.read_csv(filename)
    headlines_array = np.array(df)
    # Data fram will not be a list of text for tokenizer to process
    text = list(headlines_array[:,0])
    textList.append(text)
    #Checks if we have seen this ticker before 
    if ticker not in tickerList:
      tickerList.append(ticker)

#Gets data to be an acceptable format for our model
    inputs = tokenizer(text, padding = True, truncation = True, return_tensors='pt')
    outputs = model(**inputs)  #time consuming and crashes RAM so can't up int for loop
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

    positive = predictions[:, 0].tolist()
    negative = predictions[:, 1].tolist()
    neutral = predictions[:, 2].tolist()

    table = {'Headline': text,
         'Ticker' : ticker,
         "Positive":positive,
         "Negative":negative, 
         "Neutral":neutral}

    df = pd.DataFrame(table, columns = ["Headline", "Ticker", "Positive", "Negative", "Neutral"])
    final_table = wandb.Table(columns=["Sentence", "Ticker", "Positive", "Negative", "Neutral"])

    for headline, pos, neg, neutr in zip(text, predictions[:, 0].tolist(), predictions[:, 1].tolist(), predictions[:, 2].tolist() ): 
      final_table.add_data(headline, ticker, pos, neg, neutr)

Not quite sure what is going wrong as outputs = model(**input) runs fine outside the for loop but does not seems to run even once when I bring it inside the for loop.

cronoik
  • 15,434
  • 3
  • 40
  • 78
  • I assume that some of your datasets are pretty large and therefore you can not perform predictions in one pass? Try to limit the number of rows you feed into your model. – cronoik Apr 02 '22 at 15:26

1 Answers1

0

You do

text = list(headlines_array[:,0])

and then later,

inputs = tokenizer(text, padding = True, truncation = True, return_tensors='pt')

Hence, you give the tokenizer a list of text. It will return you a tensor for every element in your headlines_array. Unless you give it in batches, the model will calculate the predictions all at once. That can cause a memory problem.

You can do something like:

def chunks(lst, n):
    """Yield successive n-sized chunks from list."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

batch_size = 16
for batch in chunks(text, batch_size):
    inputs = tokenizer(batch, padding = True, truncation = True, return_tensors='pt')

And then continue with rest of your code.

Note: The chunks function is from How do you split a list into evenly sized chunks?