I'm working on an app that uses the video feed from the DJI Mavic 2 and runs it through a machine learning model to identify objects.
I managed to get my app to preview the feed from the drone using this sample DJI project, but I'm having a lot of trouble trying to get the video data into a format that's usable by the Vision
framework.
I used this example from Apple as a guide to create my model (which is working!) but it looks I need to create a VNImageRequestHandler
object which is created with a cvPixelBuffer
of type CMSampleBuffer
in order to use Vision
.
Any idea how to make this conversion? Is there a better way to do this?
class DJICameraViewController: UIViewController, DJIVideoFeedListener, DJISDKManagerDelegate, DJICameraDelegate, VideoFrameProcessor {
// ...
func videoFeed(_ videoFeed: DJIVideoFeed, didUpdateVideoData rawData: Data) {
let videoData = rawData as NSData
let videoBuffer = UnsafeMutablePointer<UInt8>.allocate(capacity: videoData.length)
videoData.getBytes(videoBuffer, length: videoData.length)
DJIVideoPreviewer.instance().push(videoBuffer, length: Int32(videoData.length))
}
// MARK: VideoFrameProcessor Protocol Implementation
func videoProcessorEnabled() -> Bool {
// This is never called
return true
}
func videoProcessFrame(_ frame: UnsafeMutablePointer<VideoFrameYUV>!) {
// This is never called
let pixelBuffer = frame.pointee.cv_pixelbuffer_fastupload as! CVPixelBuffer
let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, orientation: exifOrientationFromDeviceOrientation(), options: [:])
do {
try imageRequestHandler.perform(self.requests)
} catch {
print(error)
}
}
} // End of DJICameraViewController class
EDIT: from what I've gathered from DJI's (spotty) documentation, it looks like the video feed is compressed H264. They claim the DJIWidget
includes helper methods for decompression, but I haven't had success in understanding how to use them correctly because there is no documentation surrounding its use.
EDIT 2: Here's the issue I created on GitHub for the DJIWidget framework
EDIT 3: Updated code snippet with additional methods for VideoFrameProcessor
, removing old code from videoFeed
method
EDIT 4: Details about how to extract the pixel buffer successfully and utilize it can be found in this comment from GitHub
EDIT 5: It's been years since I worked on this but since there is still some activity here, here's a relevant gist I created to help others. I can't remember specifics around how/why this was relevant, but hopefully it makes sense!