1

How do I create a TensorFloat16Bit when manually doing a tensorization of the data?

We tensorized our data based on this Microsoft example, where we are converting 255-0 to 1-0, and changing the RGBA order.

        ...
        std::vector<int64_t> shape = { 1, channels, height , width };
        float* pCPUTensor;
        uint32_t uCapacity;

        // The channels of image stored in buffer is in order of BGRA-BGRA-BGRA-BGRA. 
        // Then we transform it to the order of BBBBB....GGGGG....RRRR....AAAA(dropped) 
        TensorFloat tf = TensorFloat::Create(shape);
        com_ptr<ITensorNative> itn = tf.as<ITensorNative>();
        CHECK_HRESULT(itn->GetBuffer(reinterpret_cast<BYTE**>(&pCPUTensor), &uCapacity));

        // 2. Transform the data in buffer to a vector of float
        if (BitmapPixelFormat::Bgra8 == pixelFormat)
        {
            for (UINT32 i = 0; i < size; i += 4)
            {
                // suppose the model expects BGR image.
                // index 0 is B, 1 is G, 2 is R, 3 is alpha(dropped).
                UINT32 pixelInd = i / 4;
                pCPUTensor[pixelInd] = (float)pData[i];
                pCPUTensor[(height * width) + pixelInd] = (float)pData[i + 1];
                pCPUTensor[(height * width * 2) + pixelInd] = (float)pData[i + 2];
            }
        }

ref: https://github.com/microsoft/Windows-Machine-Learning/blob/2179a1dd5af24dff4cc2ec0fc4232b9bd3722721/Samples/CustomTensorization/CustomTensorization/TensorConvertor.cpp#L59-L77

I just converted our .onnx model to float16 to verify if that would provide some performance improvements on the inference when the available hardware provides support for float16. However, the binding is failing and the suggestion here is to pass a TensorFloat16Bit.

So if I swap the TensorFloat for TensorFloat16Bit I get an access violation exception at pCPUTensor[(height * width * 2) + pixelInd] = (float)pData[i + 2]; because pCPUTensor is half of the size of what it was. It seems like I should be reinterpreting_cast to uint16_t** or something among those lines, so pCPUTensor will have the same size as when it was a TensorFloat, but then I get further errors that it can only be uint8_t** or BYTE**.

Any ideas on how I can modify this code so I can get a custom TensorFloat16Bit?

Anna Maule
  • 268
  • 1
  • 9

1 Answers1

1

Try the factory methods on TensorFloat16Bit.

However, you will need to convert you data to float16:

https://stackoverflow.com/a/60047308/11998382

Also, I might recommend you instead do the conversion within the onnx model.

Tom Huntington
  • 2,260
  • 10
  • 20
  • Thank you, that is what we ended up doing a couple weeks ago. I ran into the same stackoverflow post about this exact Float16 conversion. Our 32 bit implementation was using the `Create()` factory method so I used that instead. Thank you for your answer. – Anna Maule Feb 01 '23 at 16:12
  • @AnnaMaule did you run into any performance issues? My performance on WinML is dreadful compared to pytorch... – Tom Huntington Feb 01 '23 at 22:00
  • yes, I ran the same model in pytorch/python framework and then after converting that model to .onnx and running inside the WinML, with GPU acceleration end-to-end, there was a significant performance hit and it was very disappointing to see such discrepancy between both frameworks. – Anna Maule Feb 02 '23 at 19:09
  • 1
    @AnnaMaule it's the dynamic axis. Take them out and it's just as fast. Very disappointing. I made a [github issue](https://github.com/microsoft/Windows-Machine-Learning/issues/513#issue-1567315863) about this – Tom Huntington Feb 02 '23 at 21:15
  • 1
    @AnnaMaule they've fixed it for the next release https://github.com/microsoft/onnxruntime/issues/14550 – Tom Huntington Feb 03 '23 at 00:13
  • thank you so much for sharing your Github issues. I will give this a try and see how much this can improve the performance of the models I am looking at. – Anna Maule Feb 03 '23 at 20:47
  • 1
    @AnnaMaule you don't have to re-export you models from pytorch. You can use this [api](https://learn.microsoft.com/en-us/uwp/api/windows.ai.machinelearning.learningmodelsessionoptions.overridenameddimension?view=winrt-22621) instead – Tom Huntington Feb 03 '23 at 22:40
  • sorry for the delay. I tried the api you mentioned and unfortunately that did not help with the performance improvements I was hoping to see. – Anna Maule Mar 06 '23 at 20:13
  • I noticed you are using a python framework, while I am seeing a big discrepancy on the model inference between a python implementation versus a C++ .NET implementation. And the discrepancy is on the model inference performance. – Anna Maule Mar 06 '23 at 20:18
  • I'm using c++ WinML, just used python to get profiling. The inference times where exactly the same for me – Tom Huntington Mar 06 '23 at 20:58
  • Maybe you're running one on the cpu and one on the gpu??? – Tom Huntington Mar 06 '23 at 20:59
  • Maybe one's using an older onnxruntime version – Tom Huntington Mar 06 '23 at 21:01
  • No both ways are gpu accelerated. And we are using the latest onnxruntime version for the c++ implementation. – Anna Maule Mar 07 '23 at 19:33