Python / NiFi: ExecuteScript python, to convert an UTF-16 text files to UTF-8

Question

I have my ExecuteScript processor, and I'm trying to convert any files that come through to utf-8, if they are initially utf-16.

Thus far:

flowFileList = session.get(100)
if not flowFileList.isEmpty():
  for flowFile in flowFileList: 
     # Process each FlowFile here:
     flowFileList.decode("utf-16").encode("utf-8")

I feel like this should be a fairly easy operation, as defined in these answers: here, here, and here.

This kicks up an error, "that the object has no attribute 'decode' in ".

If this is a dumb question, feel free to say so. Thanks

Cookbook for NiFi ExecuteScript: Cookbook

Andy · Accepted Answer · 2018-12-10T23:47:22.250

3

The issue is that you are calling decode on the flowfileList object, not the individual flowfiles.

In addition, you’ll need to actually access the flowfile content and then set the content with the new encoding. Right now you are treating the flowfile object as if it is a string, but it is not. I’m away from my computer but will have working example code later.

Update

I will provide working Python code to demonstrate this, but why can't you just use the ConvertCharacterSet processor? This accepts an input character set and output character set.

Here is working code which will convert incoming flowfile content from UTF-16 to UTF-8. You should try to filter already existing UTF-8 content to skip this processor, or add code to identify it and no-op process it. You may also be interested in following NIFI-4550 - Add InferCharacterSet processor for the same behavior.

import java.io
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback

# Define a subclass of StreamCallback for use in session.write()
class PyStreamCallback(StreamCallback):
    def __init__(self):
        pass
    def process(self, inputStream, outputStream):
        text = IOUtils.toString(inputStream, StandardCharsets.UTF_16)
        outputStream.write(bytearray(text.encode('utf-8')))
# end class

flowFileList = session.get(100)
if not flowFileList.isEmpty():
    for flowFile in flowFileList:
        flowFile = session.write(flowFile, PyStreamCallback())
        flowFile = session.putAttribute(flowFile, 'script_character_set', 'UTF-8')
        session.transfer(flowFile, REL_SUCCESS)
# implicit return at the end

edited Dec 10 '18 at 23:47

answered Dec 10 '18 at 22:46

Andy

13,916
1
36
78

I'm woefully ignorant when it comes to Python, unfortunately. I appreciate your help, and this is a great learning opportunity. I will test tomorrow – papelr Dec 11 '18 at 01:42
Long story short, if you know what incoming content is UTF-16 and what isn't, route just the UTF-16 to a `ConvertCharacterSet` processor with explicit input and output character sets configured. If you don't, you'll have to use code to determine the character set and then selectively convert it using the code above. – Andy Dec 11 '18 at 01:44
To answer why `ConvertCharacterSet` did not work - it was returning something totally beyond the pale, hence `ExecuteScript` – papelr Dec 11 '18 at 14:42
It's throwing an error at line 18- in the for loop, `flowfile = session.write(flowFile,PyStreamCallback()`, saying that `TypeError: write(): 1st arg can't be configured to byte[]`. Something to do with the class? I think – papelr Dec 11 '18 at 16:36
Interestingly enough, I deleted the `bytearray` before text.encode, and that moved the file through. BUT, just like `ConvertCharacterSet`, it returned random Chinese characters – papelr Dec 11 '18 at 17:10
I believe this is because you're running it against input that is not `UTF-16` encoded. If the input is already `UTF-8`, for example, and you try to decode it using `UTF-16` and then re-encode it in `UTF-8`, it will look like that. You need to run this processing only on input that is specifically `UTF-16`. – Andy Dec 13 '18 at 19:45

Python / NiFi: ExecuteScript python, to convert an UTF-16 text files to UTF-8

1 Answers1