1

I'm trying to remove all blank lines from a large .txt file but whatever method I use it always returns this traceback:

Traceback (most recent call last):
  File "C:\Users\svp12\PycharmProjects\practiques\main.py", line 53, in <module>
    doc = nlp(texts[line])
IndexError: list index out of range

If I don't remove these spaces then I get IndexErrors on the consequent 2 for loops (or at least I think that's the reason), that's why I'm using the the try/except like this:

try:
    for word in doc.sentences[0].words:
        noun.append(word.text)
        lemma.append(word.lemma)
        pos.append(word.pos)
        xpos.append(word.xpos)
        deprel.append(word.deprel)
except IndexError:
    errors += 1
    pass

I'd like to be able to remove all blank lines and not have to avoid IndexErrors like this, any idea on how to fix?

Here's the whole code:

import io
import stanza
import os


def linecount(filename):
    ffile = open(filename, 'rb')
    lines = 0
    buf_size = 1024 * 1024
    read_f = ffile.read

    buf = read_f(buf_size)
    while buf:
        lines += buf.count(b'\n')
        buf = read_f(buf_size)

    return lines


errors = 0

with io.open('@_Calvia_2018-01-01_2022-04-01.txt', 'r+', encoding='utf-8') as f:
    text = f.read()

# replacing eos with \n, numbers and symbols
texts = text.replace('eos', '.\n')
texts = texts.replace('0', ' ').replace('1', ' ').replace('2', ' ').replace('3', ' ').replace('4', ' ')\
    .replace('5', ' ').replace('6', ' ').replace('7', ' ').replace('8', ' ').replace('9', ' ').replace(',', ' ')\
    .replace('"', ' ').replace('·', ' ').replace('?', ' ').replace('¿', ' ').replace(':', ' ').replace(';', ' ')\
    .replace('-', ' ').replace('!', ' ').replace('¡', ' ').replace('.', ' ').splitlines()

os.system("sed -i \'/^$/d\' @_Calvia_2018-01-01_2022-04-01.txt")            # removing empty lines to avoid IndexError

nlp = stanza.Pipeline(lang='ca')

nouns = []
lemmas = []
poses = []
xposes = []
heads = []
deprels = []

total_lines = linecount('@_Calvia_2018-01-01_2022-04-01.txt') - 1

for line in range(50):                                                  # range should be total_lines which is 6682
    noun = []
    lemma = []
    pos = []
    xpos = []
    head = []
    deprel = []
    # print('analyzing: '+str(line+1)+' / '+str(len(texts)), end='\r')
    doc = nlp(texts[line])
    try:
        for word in doc.sentences[0].words:
            noun.append(word.text)
            lemma.append(word.lemma)
            pos.append(word.pos)
            xpos.append(word.xpos)
            deprel.append(word.deprel)
    except IndexError:
        errors += 1
        pass
    try:
        for word in doc.sentences[0].words:
            head.extend([lemma[word.head-1] if word.head > 0 else "root"])
    except IndexError:
        errors += 1
        pass
    nouns.append(noun)
    lemmas.append(lemma)
    poses.append(pos)
    xposes.append(xpos)
    heads.append(head)
    deprels.append(deprel)

print(nouns)
print(lemmas)
print(poses)
print(xposes)
print(heads)
print(deprels)

print("errors: " + str(errors))                                                         # wierd, seems to be range/2-1

And as a side question, is worth to import os just for this line? (which is the one removing the blank lines

os.system("sed -i \'/^$/d\' @_Calvia_2018-01-01_2022-04-01.txt")
Jason Aller
  • 3,541
  • 28
  • 38
  • 38
  • @nonDucor How could I fix it then? – Questioneer Jul 07 '22 at 08:12
  • Unrelated to your question, but I couldn't help noticing that you call `.replace` *21* (!) times. As a rule of thumb, repetition in programming is bad. If you have to call something two or more times, chances are there is a better (i.e., less redundant) way of doing it. In your case, a regular expression is the way to go: `import re` at the top of your file, then `texts = re.sub(r'[\d,"·?¿:;!¡.-]', ' ', texts)` does exactly the same thing as your 21 calls to `replace`. – fsimonjetz Jul 07 '22 at 15:07
  • What are you trying to achieve with `sentences[0]`? Only process the first sentence? If `sentences` is empty (e.g., because you passed an empty line), this will result in an `IndexError`. – fsimonjetz Jul 07 '22 at 15:34

2 Answers2

1

I can't guarantee that this works because I couldn't test it, but it should give you an idea of how you'd approach this task in Python. I'm omitting the head processing/the second loop here, that's for you to figure out.

I'd recommend you throw some prints in there and look at the output, make sure you understand what's going on (especially with different data types) and look at examples of applications using Stanford NLP, watch some tutorials online (from start to finish, no skipping), etc.

import stanza
import re

def clean(line):
    # function that does the text cleaning
    line = line.replace('eos', '.\n')
    line = re.sub(r'[\d,"·?¿:;!¡.-]', ' ', line)
    
    return line.strip()

nlp = stanza.Pipeline(lang='ca')

# instead of individual variables, you could keep the values in a dictionary
# (or just leave them as they are - your call)
values_to_extract = ['text', 'lemma', 'pos', 'xpos', 'deprel']
data = {v:[] for v in values_to_extract}

with open('@_Calvia_2018-01-01_2022-04-01.txt', 'r', encoding='utf-8') as f:
    for line in f:

        # clean the text
        line = clean(line)

        # skip empty lines
        if not line:
            continue
        
        doc = nlp(line)

        # loop over sentences – this will work even if it's an empty list
        for sentence in doc.sentences:

            # append a new list to the dictionary entries
            for v in values_to_extract:
                data[v].append([])

            for word in sentence.words:
                for v in values_to_extract:

                    # extract the attribute (e.g., 
                    # a surface form, a lemma, a pos tag, etc.)
                    attribute = getattr(word, v)

                    # and add it to its slot
                    data[v][-1].append(attribute)

for v in values_to_extract:
    print('Value:', v)
    print(data[v])
    print()
fsimonjetz
  • 5,644
  • 3
  • 5
  • 21
  • Iterating over the sentences in the doc is correct. There's no guarantee the returned document has exactly 1 sentence in this library. – John Jul 27 '22 at 06:53
0

Because texts doesn't have 50 lines, why do you hardcode 50?

If you just need to remove blank lines you only have to do text = text.replace("\n\n","\n")

if you need to remove lines that are just whitespaces you can just do:

text = '\n'.join(line.rstrip() for line in text.split('\n') if line.strip())

Axeltherabbit
  • 680
  • 3
  • 20