0

I am running python code on data-bricks to clean the text. Some text has values like this "环境ä¸å¥½ä½¿ç”¨ which I wan't to remove.

Here is the code:

def docs_preprocessor(docs):
    tokenizer = RegexpTokenizer(r'\w+')
    for idx in range(len(docs)):
        # print(docs[idx])
        docs[idx] = " ".join(w.lower() for w in nltk.wordpunct_tokenize(docs[idx]) if w.lower() in words or not w.isalpha())
        docs[idx] = ' '.join(s for s in docs[idx].split() if not any(c.isdigit() for c in s))
        # print(docs[idx])
        docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.
        # print(docs[idx])
        # docs[idx] = docs[idx].lower()  # Convert to lowercase.

    # Remove numbers, but not words that contain numbers.
    docs = [[token for token in doc if not token.isdigit()] for doc in docs]

    docs = [[token.strip("_") for token in doc ] for doc in docs]
    # Remove words that are only one character
    docs = [[token for token in doc if len(token) > 3] for doc in docs]

    # Lemmatize all words in documents.
    lemmatizer = WordNetLemmatizer()
    docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]
    docs = [" ".join(doc) for doc in docs]

    return docs

But I am getting the error as:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 4: ordinal not in range(128)

I tried fixing this using this link: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)

But it didn't work.

When I checked what is the python version in Databricks:

from platform import python_version

print(python_version())

2.7.12

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
MAC
  • 1,345
  • 2
  • 30
  • 60
  • Python 2 reaches its end of life in a couple of days. Is there a good reason for you to still use it? – Note: Python 3 won't magically solve all your Unicode errors. But if forces you to clearly separate text (Unicode) and byte strings, and it might help you avoid garbled text as shown in the first place, or at least allow you to cleanly remove it. – lenz Dec 19 '19 at 09:22
  • Its strange but if you notice the print statement, the code follows python3 syntax. I don't know exactly what is the problem. – MAC Dec 19 '19 at 09:27
  • 2
    `print(...)` is legal Python-2 syntax, and with `from __future__ import print_statement` you can even call `print(..., sep='\t')`. The point is that Python 2 allows you to do stuff like `'å' + u'å'` through implicit coercion, which often doesn't do what you actually want to happen, and it's a pain to debug. – lenz Dec 19 '19 at 10:38
  • 1
    Any ideas how to update to python 3.6 in Databricks. If that solves the problem. – MAC Dec 19 '19 at 16:51
  • I don't know what Databricks is, so no, sorry... Whether it will solve your problem or not: see my first comment. – lenz Dec 19 '19 at 23:31
  • The traceback is incomplete and we have no way of knowing the actual encoding of your input data. Please [edit] to provide the full traceback and a [mre] with data in a well-defined representation; see [the Stack Overflow `character-encoding` tag info page](http://stackoverflow.com/tags/character-encoding/info) for guidance. – tripleee Dec 20 '19 at 08:45

0 Answers0