1

I am trying to pre-process my text data for a word alignment task.

I have a text file of sentences. Each sentence is on a new line:

a man in an orange hat starring at something . 
a boston terrier is running on lush green grass in front of a white fence . 
a girl in karate uniform breaking a stick with a front kick . 
five people wearing winter jackets and helmets stand in the snow , with snowmobiles in the background . 
people are fixing the roof of a house . 
a man in light colored clothing photographs a group of men wearing dark suits and hats standing around a woman dressed in a strapless gown .

I am using Stanza to tokenise the sentences:

! pip install stanza
import stanza
stanza.download("en", model_dir="/content/drive/MyDrive/Internship/")
nlp_en = stanza.Pipeline("en", dir= "/content/drive/MyDrive/Internship/", processors = "tokenize, pos, lemma, mwt")

with open("/content/drive/MyDrive/Internship/EN_sample.txt", "r", encoding='utf8') as english:
  english = english.read()

doc_en = nlp_en(english)

en_token = []
for i, sentence in enumerate(doc_en.sentences):
    list_of_tokens = [sent.text for sent in sentence.tokens]
    en_token.append(list_of_tokens)

My expected output is:

[["a", "man", "in", "an", "orange", "hat", "starring", "at", "something", "."],  ["a", "boston", "terrier", "is", "running", "on", "lush", "green", "grass", "in", "front", "of", "a", "white", "fence", "."],  
["a", "girl", "in", "karate", "uniform", "breaking", "a", "stick", "with", "a", "front", "kick", "."], 
["five", "people", "wearing", "winter", "jackets", "and", "helmets", "stand", "in", "the", "snow", ",", "with", "snowmobiles", "in", "the", "background", "."], 
["people", "are", "fixing", "the", "roof", "of", "a", "house", "."], 
["a", "man", "in", "light", "colored", "clothing", "photographs", "a", "group", "of", "men", "wearing", "dark", "suits", "and", "hats", "standing", "around", "a", "woman", "dressed", "in", "a", "strapless", "gown", "."]] 

Essentially, a list of lists, with each sentence in its own list and its words tokenised.

However, the output that I get is this:

[["a", "man", "in", "an", "orange", "hat", "starring", "at", "something", "."],  ["a", "boston", "terrier", "is", "running", "on", "lush", "green", "grass", "in", "front", "of", "a", "white", "fence", "."],  
["a", "girl", "in", "karate", "uniform", "breaking", "a", "stick", "with", "a", "front", "kick", ".", "five", "people", "wearing", "winter", "jackets", "and", "helmets", "stand", "in", "the", "snow", ",", "with", "snowmobiles", "in", "the", "background", ".", "people", "are", "fixing", "the", "roof", "of", "a", "house", "."], 
["a", "man", "in", "light", "colored", "clothing", "photographs", "a", "group", "of", "men", "wearing", "dark", "suits", "and", "hats", "standing", "around", "a", "woman", "dressed", "in", "a", "strapless", "gown", "."]]

Stanza appears to be ignoring sentence boundaries in certain instances.

Would anyone know how to remedy this?

Since each sentence begins with a newline character, would it be possible to simply force a new list at every newline character and then perform word tokenisation? If yes, how would I do that?

Thank you in advance for any help and advice.

1 Answers1

0

This is from the documentation: "Assume the sentences are split by two continuous newlines (\n\n)."

https://stanfordnlp.github.io/stanza/tokenize.html

"tokenize_no_ssplit" is the flag to disable it. A possible fix would be modify your text by adding the newlines.

acivic2nv
  • 99
  • 1
  • 4