I am more of an OpenGL guy so I cannot help you with the technical side, what I can do is give you a direction.
In general, a 3D camera has:
Translation - where the camera is (x, y, z)
Rotation - angle of the camera around each axis
What you want do is related to the translation part only:
- Monitor the user input, and capture when he presses one of the arrow keys.
- When an arrow key is down, you want to start changing the camera translation. If the user presses the up key, you would add a constant value to the corresponding camera translation component (x) each iteration of your main game loop until he releases the key. If he presses the down key, you would want to subtract that value instead of adding it.
- When he releases the key, your code needs to stop adding that value to the camera translation.
Let's assume that your game runs at 60Hz, and you add, say, 1/60 units to the camera translation each iteration for each direction the user wants to go. If the user held up the up arrow key for 2 seconds, the camera would have moved forward 2 units.
This is the "theory" in general, now I can only point you to web pages I found that may be useful for solving the technical side of your problem:
DirectX camera movement - I'm guessing this article has a lot more than you need, but it looked pretty good and I think that you should read it anyway... But you can just skip to the View Transformation part.
Input handling - nothing much to say, regular Win32 input handling. If you are not familiar with win32 input handling I think that you should take an hour or two to learn that first.
Alright that's it, hope I helped