0

I am trying to get all the lines from my input file and save them in the lists dataset_texts and dataset_labels. But instead I am getting only the last line of my input file.

The variable text_str gets the text sequence in the line and the variable labels_str saves the vector that correspond to the text sequence in the same line. The variable label saves the position of 1 in the vector. Finally I want to save these lines in two lists dataset_texts and dataset_labels, but for some reason that I could not understand, it's saving only the last line.

Please advice how can I get the lists with all my lines and their respective positions of 1 in the vector? This is the code that I have so far and checked line by line.

from transformers import BertTokenizer
import torch
import re

training_set_path = '../test.txt'

regexp = r'^(.*)\t(\d+)$'

dataset_texts = list()
dataset_labels = list()

input_file = open(training_set_path, 'rb' )
print("Dataset loaded")

num_labels = 0 
print("Num_labels")
print(num_labels)
#labels_str = []   # added by me 
for line in input_file:
    line = line.decode( errors = 'replace' )
    #print(line)
    if re.match(regexp, line):
      text_str = re.findall( regexp, line )[0][0]  # getting the aa sequence
      print("here text_str")
      print(text_str)
      labels_str = re.findall( regexp, line )[0][1] # getting the corresponding vector
      print("here labels_str")
      print(labels_str)
      label = labels_str.index('1')
      print("here label")
      print(label)
      dataset_texts.append( text_str )
      dataset_labels.append( label )
      num_labels = len(labels_str)
      print("Here length num_labels")
      print(num_labels)
      counter += 1

    # else:
    #   break
input_file.close()
print("______________________________________________________________________")
print("Here dataset_text")
print(dataset_texts)
print("Here dataset_labels")
print(dataset_labels)
output_file = open( logs_path, 'w')
num_labels = len(labels_str)

My output is as follows:

Dataset loaded
Num_labels
0
here text_str
Q Q L R K P A E E L G R E I T H Q L F L L G C G A Q M L K Y A S P P M A Q A W C Q V M L D T R G G V R L S E Q I Q N D L L
here labels_str
1000000000000000000000000000000000000000000000000000000000000
here label
0
Here length num_labels
61
______________________________________________________________________
Here dataset_text
['Q Q L R K P A E E L G R E I T H Q L F L L G C G A Q M L K Y A S P P M A Q A W C Q V M L D T R G G V R L S E Q I Q N D L L']
Here dataset_labels
[0]
Vykov
  • 41
  • 5

1 Answers1

0

I believe the issue is with your regex. Change regexp = r'^(.*)\t(\d+)$' to regexp = r'^(.*)\t(\d+)(\r\n|\r|\n)$' so that it matches new line characters at the end of each line

I ran into an error with this label = labels_str.index('1') after fixing the regex. So, you may want to remove that. You will also need to define counter outside of the loop before trying to add to it. The code will also error out if there are no matches because you print out variables at the end that are only defined when there is a match. So I would probably also define all those variables outside the loop as blank string.

Hopefully I guessed right in the format of your input file. Some text followed by a tab and then some digits.

sample output


Here dataset_text
['abasd\tTEST', 'FASDASD\t345678 TEST', 'FASDASD\t345678 TEST', 'FASDASD\t345678 TEST', 'FASDASD\t345678 TEST']
Here dataset_labels
['1234', '4321', '8964', '1234', '1234']
kconsiglio
  • 401
  • 1
  • 8