The trade-off is fairly simple. A LUT uses extra memory in the hope of reducing the instruction count enough to save some time. Whether it's effective will depend a lot on the details of the processor -- caching in particular.
For Newton-Raphson, you change X/Y to X* (1/Y) and use your iteration to find 1/Y. At least in my experience, if you need full precision, it's rarely useful -- it's primary strength is in allowing you to find something to (say) 16-bit precision more quickly.
The usual method for division is a bit-by-bit method. Although that particular answer deals with integers, for floating point you do essentially the same except that along with it you subtract the exponents. A floating point number is basically A*2N, where A is the significand and N is the exponent part of the number. So, you take two numbers A*2N / B * 2M, and carry out the division as A/B * 2N-M, with A and B being treated as (essentially) integers in this case. The only real difference is that with floating point you normally want to round rather than truncate the result. That basically just means carrying out the division (at least) one extra bit of precision, then rounding up if that extra bit is a one.
The most common method using lookup tables is SRT division. This is most often done in hardware, so I'd probably Google for something like "Verilog SRT" or "VHDL SRT". Rendering it in C++ shouldn't be terribly difficult though. Where the method I outlined in the linked answer produces on bit per iteration, this can be written to do 2, 4, etc. If memory serves, the size of table grows quadratically with the number of bits produced per iteration though, so you rarely see much more than 4 in practice.