It's nontrivial task to compare audio signals.
The audio is just a sequence of values (numbers) where index is just a "time" and value is a loudness of sound (amplitude).
If you compare audio data like two arrays (sequences) element by element, iterating through the index - it will be luck to get something reasonable. Though you need some transformation of this array to get aggregated info about this sequence of numbers as a whole (for example - spectre of signal).
There are some mathematical tools for this task, for example, mentioned by you well-known Fourier Transform and statistical tool Autocorrelation (it finds "kindness" of sequence of numbers).
The autocorrelation method can be relatively simple - you just iterate comparing arrays of data and calculate the autocorrelation. But you will pay for simplicity in case of initial quality (or preparation/normalization) of signals - they should have similar duration. The value of resulted correlation function will show how differ two sequences, i.e. 0
- is absolutely different and 1
- is almost the same.
To implement Fourier Transform (FFT) is not a problem too, you could take well described algo and implement it itself on any language without using third party libs. It does the job very well.
FT will help you get a spectrum of the signal i.e. another set of values: set of amplitudes per frequency (roughly, frequency as array index instead of time in case of input raw signal) and now you can compare this given spectrums almost like two arrays iterating through an index (frequency) and then decide on their similarity - calculate deltas and see whether it hit into some acceptance interval (or you can use more correct statistical methods e.g. correlation function).
As for noised signal, the noise is usually subtracted from the given data set (but here you should know the sort of noise type).
It is all related to signal processing area and if you're working on such project you need to learn more about this.
Bonus: a book for example