udpipe_annotate() in r labels the same word differently if followed by punctuation

Question

I'm doing a standard topic modelling task on nouns in newspaper articles using udpipe to annotate the article content. Using the function udpipe_annotate() I noticed that words together with the following punctuation mark sometimes were labelled as upos = NOUN. Thus when I run the topic model function - LDA() from library topicmodels - the most common words for a topic might include, say, 'product' and 'product.', the latter including the punctuation mark. They should be seen as the same word. How can I remedy this and remove the punctuation?

Another issue is when words before a punctuation were labelled as upos = PUNCT. E.g. 'energy' and 'energy,' were labelled differently. Thus I have to specify that I want to include PUNCT in the analysis, and even then I run into the same problem as above of the algorithm treating this as two different words. Is this a problem with the udpipe annotation or is there an easy fix to this problem?

EDIT: Adding code example using first two sentences of wikipedia article on Norway in Norwegian:

text <- c('Norge, offisielt Kongeriket Norge, er et nordisk, europeisk land og en selvstendig stat vest på Den skandinaviske halvøy. Geografisk sett er landet langt og smalt.', 'På den langstrakte kysten mot Nord-Atlanteren befinner Norges vidkjente fjorder seg.', 'Kongeriket Norge omfatter hovedlandet (fastlandet med tilliggende øyer innenfor grunnlinjen), Jan Mayen og Svalbard.')

id <- c(1:3)

df <- data.frame(text, id)

ud_model <- udpipe_download_model(language = "norwegian-bokmaal")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = df$text, doc_id = df$id)
x_df = data.frame(x)

Showing example of the problematic outputs (the rest (ADJ, VERB, etc) are fine I think):

head(x_df[x_df$upos=='NOUN',5:8], 5)

OUTPUT:

token_id	token	lemma	upos
1	Norge,	norge,	NOUN
4	Norge,	norge,	NOUN
9	land	land	NOUN
13	stat	stat	NOUN
18	halvøy.	halvøy.	NOUN

head(x_df[x_df$upos=='PUNCT',5:8])

The words with token_id 1,4,and 18 are not correct.

OUTPUT:

token_id	token	lemma	upos
7	nordisk,	$nordisk,	PUNCT
10	grunnlinjen),	$grunnlinjen),	PUNCT

Here, udpipe is finding the punctuation but it also includes the preceding word.

EDIT2: The problem does not occur for me with the French or English language models. Nor does it seem to appear on the norwegian-nynorsk version.

Normally the punctuation should have been stripped of the words and by available as a separate upos (PUNCT). You will have to add an example text + code of where it goes wrong because now it is just guessing. — phiver, Jul 25 '22 at 08:33
I have now added an example text and code and it appears that the problem is specific to the norwegian-bokmaal udpipe language model. If the problem persists it is pretty detrimental to any analysis as it basically mischaracterises at least one word for each sentence, possibly more if there are commas etc. — Hal, Jul 25 '22 at 10:24

score 1 · Accepted Answer · answered Jul 25 '22 at 12:34

1

Looks like there is an issue with the norwegian-bokmaal ud 2.5 model. Looking at the ud treebank for norwegian bokmal they are already on version 2.10.

If you use either norwegian-nynorks it works correctly or norwegian-bokmaal ud 2.4 model.

# switch to older model
ud_model <- udpipe_download_model(language = "norwegian-bokmaal", 
                                  udpipe_model_repo = "jwijffels/udpipe.models.ud.2.4")

# nynorsk works as well
ud_model <- udpipe_download_model(language = "norwegian-nynorsk")

You can, of course, get version 2.10, but then you have to train your udpipe model yourself. More info about this in the Model Building vignette.

answered Jul 25 '22 at 12:34

phiver

23,048
14
44
56

1

Thanks, both of your solutions solve the technical issue. I think, however, that using nynorsk on a bokmaal corpus is not a good idea. These are two different writing standards applied to the same language. There is a lot of overlap which will be coded correctly, but unless the nynorsk standard is very liberal in the language forms it accepts it will not be without considerable errors. – Hal Jul 25 '22 at 13:46
Yeah, In that case bokmaal is the better choice. Always interesting to see how a language evolves and also that either language is considered written Norse with bokmal about 85-90%, but neither is spoken Norse, as everyone speaks a local dialect. :-) – phiver Jul 25 '22 at 14:00

udpipe_annotate() in r labels the same word differently if followed by punctuation

1 Answers1