1

when processing a pandas.df with for loop.I usually meet up with errors. When the error has been removed, I will have to restart the for loop form the beginning of the dataframe. How can I start the for loop from the error position, getting rid of run it repeatedly. For example:

senti = []
for i in dfs['ssentence']:
   senti.append(get_baidu_senti(i))

in the code above, I'm trying to do the sentiment analysis through api and store them into a list.However, the api only input GBK format whereas my data are encoded in utf-8. So it usually meet up with errors like this:

UnicodeEncodeError: 'gbk' codec can't encode character '\u30fb' in position 14: illegal multibyte sequence

So I have to delete the specific items like'\u30fb' manually and restart the for loop. At this time, the list"senti" contains so many data already so I don't want to abandon them. What can I do to improve the for loop?

Helix Herry
  • 327
  • 1
  • 4
  • 14
  • 2
    Why not just specify an error handler when encoding? – Martijn Pieters Feb 12 '19 at 14:22
  • @MartijnPieters I have to post data in GBK format, how to write the error handler when encoding?The only thing I can imagine is to delete the error character. – Helix Herry Feb 12 '19 at 14:33
  • That's what the error handler can do for you: `.encode('gbk', 'ignore')` skips codepoints that can't be encoded. – Martijn Pieters Feb 12 '19 at 14:51
  • And are you certain that your data is encoded as UTF8? You appear to have string values, so Unicode values already decoded from bytes (UTF-8 implies that you have binary, encoded data). – Martijn Pieters Feb 12 '19 at 14:53

1 Answers1

1

If you API requires encoding to GBK, then just encode to that codec using an error handler other than 'strict' (the default).

'ignore' will drop any codepoints that can't be encoded to GBK:

dfs['ssentence_encoded'] = dfs['ssentence'].str.encode('gbk', 'ignore')

See the Error Handlers section of Python's codecs documentation.

If you need to pass in strings, but only strings that can safely be encoded to GBK, then I'd create a translation map suitable for the str.translate() method:

class InvalidForEncodingMap(dict):
    def __init__(self, encoding):
        self._encoding = encoding
        self._negative = set()
    def __missing__(self, codepoint):
        if codepoint in self._negative:
            raise LookupError(codepoint)
        if chr(codepoint).encode(self._encoding, 'ignore'):
            # can be mapped, record as a negative and raise
            self._negative.add(codepoint)
            raise LookupError(codepoint)
        # map to None to remove
        self[codepoint] = None
        return None

only_gbk = InvalidForEncodingMap('gbk')
dfs['ssentence_gbk_safe'] = dfs['sentence'].str.translate(only_gbk)

The InvalidForEncodingMap class lazily creates entries as codepoints are looked up, so only codepoints that are actually present in your data are processed. I'd still keep the map instance around for re-use if you need to use it more than once, the cache it builds up can be reused that way.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343