The answer is right there in the question: between 1 and 3 cycles depending on things. Even on something as relatively simple as Cortex-M4 there are enough factors that it's not necessarily possible (or useful) to specify some hard-and-fast rule. However, that's not to say we can't do a bit of reasoning given the available information:
depending on the alignment and width of the target instruction
Instruction fetches are 32 bits wide, therefore it's fairly safe to assume that the 3-cycle worst-case involves a halfword-aligned 32-bit target instruction, needing 2 instruction fetches before the whole instruction can be decoded. Chances are, then, that a 16-bit target instruction or a word-aligned 32-bit one, covered by a single instruction fetch, would be reached in one fewer cycle.
and whether the processor manages to speculate the address early
Given the above, then it seems reasonable that the difference between a successful branch prefetch and an unsuccessful one accounts for the other of the 2 cycles between best-case and worst-case. There doesn't seem to be much information available about the branch predictor, but I'd assume it's a simple static predictor in the decode stage of the pipeline, in which case it's probably the case that register branches (including PC writes) and conditional forward branches are not predicted, and unconditional immediate branches and conditional backward branches are predicted.
Now, this is just educated guessing - I don't know the secrets of ARM's microarchitectures, so there may be more subtleties than I've imagined here, but it's already complicated enough. I doubt anyone would care to pick through disassembled code, cross-referencing against all the possible branch/target combinations, just to account for 2 cycles here and there - if you really need to know how many cycles a piece of code executes in, then the best thing to do is just execute it and count the cycles.