I am trying to pre-process my text data for a word alignment task.
I have a text file of sentences. Each sentence is on a new line:
a man in an orange hat starring at something .
a boston terrier is running on lush green grass in front of a white fence .
a girl in karate uniform breaking a stick with a front kick .
five people wearing winter jackets and helmets stand in the snow , with snowmobiles in the background .
people are fixing the roof of a house .
a man in light colored clothing photographs a group of men wearing dark suits and hats standing around a woman dressed in a strapless gown .
I am using Stanza to tokenise the sentences:
! pip install stanza
import stanza
stanza.download("en", model_dir="/content/drive/MyDrive/Internship/")
nlp_en = stanza.Pipeline("en", dir= "/content/drive/MyDrive/Internship/", processors = "tokenize, pos, lemma, mwt")
with open("/content/drive/MyDrive/Internship/EN_sample.txt", "r", encoding='utf8') as english:
english = english.read()
doc_en = nlp_en(english)
en_token = []
for i, sentence in enumerate(doc_en.sentences):
list_of_tokens = [sent.text for sent in sentence.tokens]
en_token.append(list_of_tokens)
My expected output is:
[["a", "man", "in", "an", "orange", "hat", "starring", "at", "something", "."], ["a", "boston", "terrier", "is", "running", "on", "lush", "green", "grass", "in", "front", "of", "a", "white", "fence", "."],
["a", "girl", "in", "karate", "uniform", "breaking", "a", "stick", "with", "a", "front", "kick", "."],
["five", "people", "wearing", "winter", "jackets", "and", "helmets", "stand", "in", "the", "snow", ",", "with", "snowmobiles", "in", "the", "background", "."],
["people", "are", "fixing", "the", "roof", "of", "a", "house", "."],
["a", "man", "in", "light", "colored", "clothing", "photographs", "a", "group", "of", "men", "wearing", "dark", "suits", "and", "hats", "standing", "around", "a", "woman", "dressed", "in", "a", "strapless", "gown", "."]]
Essentially, a list of lists, with each sentence in its own list and its words tokenised.
However, the output that I get is this:
[["a", "man", "in", "an", "orange", "hat", "starring", "at", "something", "."], ["a", "boston", "terrier", "is", "running", "on", "lush", "green", "grass", "in", "front", "of", "a", "white", "fence", "."],
["a", "girl", "in", "karate", "uniform", "breaking", "a", "stick", "with", "a", "front", "kick", ".", "five", "people", "wearing", "winter", "jackets", "and", "helmets", "stand", "in", "the", "snow", ",", "with", "snowmobiles", "in", "the", "background", ".", "people", "are", "fixing", "the", "roof", "of", "a", "house", "."],
["a", "man", "in", "light", "colored", "clothing", "photographs", "a", "group", "of", "men", "wearing", "dark", "suits", "and", "hats", "standing", "around", "a", "woman", "dressed", "in", "a", "strapless", "gown", "."]]
Stanza appears to be ignoring sentence boundaries in certain instances.
Would anyone know how to remedy this?
Since each sentence begins with a newline character, would it be possible to simply force a new list at every newline character and then perform word tokenisation? If yes, how would I do that?
Thank you in advance for any help and advice.