0

I have numerous tsv file containing two columns. First column is made up ofsentences and second column is made of polarity of those sentences. the delimiter is a tabulation. I would like to extract the lines which have a polarity of "0".

I made up this small code but whatsoever it does not work and return 0 sentences.

    for d in directory:
        print(" directory: ", d)
        splits = ['dev1'] #,'test1','train1']

        for s in splits:

            print(" sous-dir : ", s)
            path = os.path.join(indir, d)
            with open(os.path.join(path, s+'.tsv'), 'r', encoding='utf-8') as f_in:
              next(f_in)
              for line in f_in:
                if line.split('\t')[1] == 0:
                  doc = nlp(line.split('\t')[0])

                  line_split = [sent.text for sent in doc.sents]

                  for elt in line_split:
                    sentences_list.append(elt)


    print("nombres total de phrases :", len(sentences_list))


Why is line.split('\t')[1] not equal to 0 if line is the string "Je suis levant\t0\n"

ex. of a file

gnfjfklfklf  0
fokgmlmlrfm  1
eoklplrmrml  0
ekemlremeùe  0

I would like to keep line which end with "0"

kely789456123
  • 605
  • 1
  • 6
  • 21
  • 2
    Because it's `0\n` (length 2). – CristiFati Jun 09 '20 at 09:57
  • 3
    Also, `split` by definition returns a string, while `0` is an int. – deceze Jun 09 '20 at 09:58
  • Ok. Thank you, do you know perhaps how can I improve in order to be able to extract the part which end with 0; I try .endswith but same result @deceze – kely789456123 Jun 09 '20 at 09:59
  • `line.strip().endswith('0')`…? – deceze Jun 09 '20 at 10:01
  • use strip function which removes the line breaks and then convert the result to int. `int(line.split('\t')[1].strip())`. https://stackoverflow.com/questions/761804/how-do-i-trim-whitespace-from-a-string – Banana Jun 09 '20 at 10:03
  • Before the if condition, add: `line = line.rstrip()` to get rid of the '\n' at the end of line. Then you can change your check to `if line.split('\t')[1] == '0':` or `if line.endswith('0'):` – DarrylG Jun 09 '20 at 10:06

1 Answers1

1

After splitting you need to strip the string in order to remove the garbage that IO puts in there, such as line breaks, other tabs etc. For that Python has a .strip() function.

You're also doing a comparison between String and Integer, so in order for it to not fail with a type error, you must either change the code to compare strings or convert the result from file to Integer with int().

Condition could be rewritten as:

if int(line.split('\t')[1].strip()) == 0:

or as:

if line.split('\t')[1].strip() == "0":

Banana
  • 814
  • 1
  • 8
  • 28