Gensim Lemmatization Remove Postag b'

Question

I am trying to lemmatize documents with the following codes. Lemmatization works. It produces byte string. Therefore, the next part of the codes produces "cant concan byte to str" error. Then I have changed tokens as str() as given in below codes. The output of the code is as given below;(I am using Python 3.7 (64 bit))

AttributeError                            Traceback (most recent call last)
<ipython-input-223-cb505389f802> in <module>
      1 #Build a Vocabulary
----> 2 model.build_vocab(train_demo_corpus)

~\Anaconda3\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self, documents, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
    727         """
    728         total_words, corpus_count = self.vocabulary.scan_vocab(
--> 729             documents, self.docvecs, progress_per=progress_per, trim_rule=trim_rule)
    730         self.corpus_count = corpus_count
    731         report_values = self.vocabulary.prepare_vocab(

~\Anaconda3\lib\site-packages\gensim\models\doc2vec.py in scan_vocab(self, documents, docvecs, progress_per, trim_rule)
    807         for document_no, document in enumerate(documents):
    808             if not checked_string_types:
--> 809                 if isinstance(document.words, string_types):
    810                     logger.warning(
    811                         "Each 'words' should be a list of words (usually unicode strings). "

AttributeError: 'str' object has no attribute 'words'

here is my code;

train_demo_corpus = list(lemmat(lee_train_demo_file))

def lemmat(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            tokens = gensim.utils.lemmatize(line)
            if tokens_only:
                yield str(tokens)
            else:
                # For training data, add tags
                yield str(gensim.models.doc2vec.TaggedDocument(tokens, [i]))

model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
model.build_vocab(train_demo_corpus)

Best regards,

'b' denotes a byte string. Take a look at https://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal for more info. You can Google for a lot of info on how to work with this. — bivouac0, Nov 10 '19 at 15:41
Yes, as @bivouac0 notes, the `b'` you're seeing isn't part of the string, but an indicator to you, the programmer, of its type. If you're printing a raw Python object (like the `TaggedDocument`), it's proper for it to appear. On the other hand, if you try printing such a string like the first word of the first document directly – `print (train_demo_corpus[0].words[0])` – it shoudn't appear. So if there's some other problem with it, please add more details as to why it's a problem. — gojomo, Nov 10 '19 at 18:14
hi @gojomo, let me specify the problem: in the next step I got "TypeError: can't concat str to bytes" when the following codes run: model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40) model.build_vocab(train_demo_corpus) ... My question is why the codes given in my question produce binary output? How can I decode them as str? — Oguzhan Alasehir, Nov 10 '19 at 20:08
I'd need to see the whole error stack (with lines of code & line numbers) to understand that next error you've received. You could edit your question to add it, so you have more space & formatting options than in these comments. (Also, as such encoding/string-type issues vary a bit between Python 2.x & Python 3.x, please mention which you're using.) — gojomo, Nov 10 '19 at 20:58
It would have been better to add the new info, rather than changing completely what the question is about. But, the current problem with your code is that your `lemmat()` function is yielding strings. `Doc2Vec` requires each item in its training-corpus to be a `TaggedDocument`-shaped object, with `words` and `tags` properties. Chang your `yield` line to simply `yield TaggedDocument(tags, [i])` and you won't get the current error. — gojomo, Nov 10 '19 at 21:21
Hi @gojomo, actually, "else" part is for training corpus. Therefore, as you can see, it is very similar to your suggestion. As I have tried to explain in the body of question, the code produces byte string and the code "model.build_vocab(train_demo_corpus)" gives "TypeError: can't concat str to bytes." That's why I have used "str()". Btw, when I have used "tokens = gensim.utils.simple_preprocess(line)" instead of "tokens = gensim.utils.lemmatize(line..." everything is fine. — Oguzhan Alasehir, Nov 11 '19 at 20:37
There's no `TypeError` described, with full error message and stack, in your current question text, so it's hard to help debug that. The error which is shown, `AttributeError: 'str' object has no attribute 'words'`, would be completely solved by not applying `str()` over the `TaggedDocument`. So it's unclear what problem you still have, if any. — gojomo, Nov 12 '19 at 19:36

score 0 · Answer 1 · answered Apr 08 '21 at 11:41

The lemmatize function in gensim.utils encodes the lamma into a byte string via lemma.encode('utf8'). You can either remove the .encode('utf8') in your local files for gensim, create your own copy of lemmatize and remove the encoding or add the following to your code:

tokens = gensim.utils.lemmatize(line)
b' '.join(tokens).decode('utf-8').split(' ')

Gensim Lemmatization Remove Postag b'

1 Answers1