As you state, you have several options for this. Whichever you regard as "best" will depend on your specific needs.
Probably your simplest non-open-source route would be to use Core Image. Getting the best performance out of Core Image video filtering will still take a little work, since you'll need to make sure you're doing GPU-side processing for that.
In a benchmark application I have in my GPUImage framework, I have code that uses Core Image in an optimized manner. To do so, I set up AV Foundation video capture and create a CIImage from the pixel buffer. The Core Image context is set to render to an OpenGL ES context, and the properties on that (colorspace, etc.) are set to render quickly. The settings I use there are ones suggested by the Core Image team when I talked to them about this.
Going the raw OpenGL ES route is something I talk about here (and have a linked sample application there), but it does take some setup. It can give you a little more flexibility than Core Image because you can write completely custom shaders to manipulate images in ways that you might not be able to in Core Image. It used to be that this was faster than Core Image, but there's effectively no performance gap nowadays.
However, building your own OpenGL ES video processing pipeline isn't simple, and it involves a bunch of boilerplate code. It's why I wrote this, and I and others have put a lot of time into tuning it for performance and ease of use. If you're concerned about not understanding how this all works, read through the GPUImageVideo class code within that framework. That's what pulls frames from the camera and starts the video processing operation. It's a little more complex than my benchmark application, because it takes in YUV planar frames from the camera and converts those to RGBA in shaders in most cases, instead of grabbing raw RGBA frames. The latter is a little simpler, but there are performance and memory optimizations to be had with the former.
All of the above was talking about live video, but prerecorded video is much the same, only with a different AV Foundation input type. My GPUImageMovie class has code within it to take in prerecorded movies and process individual frames from that. They end up in the same place as frames you would have captured from a camera.