How to Average Complex Responses
IMPORTANT  The results below are from adding noise to both input and output. After understanding better the system, and the models involved in the statistics, I now realise that was wrong. New results with noise on the output only clearly show Correln as best (with Cartesian OK). I plan to post those results here, but can’t say commit to a date.
Background
I’m going to post a fragment of some internal work here  I think it’s interesting, and I wanted to share the results with others  but I also need to give some disclaimers:

This is part of an internal discussion, to help find consensus between a number of different people.

It’s only a small part of the work, focusing on a purely practical approach.
The last point is important to me. I’m a firm believer that there’s generally a right way to do statistics; that given a model for the processes involved you can use maximum likelihood or Bayesian statistics to find the right answer to a problem. But that doesn’t mean that a practical exploration of the issues isn’t also useful.
Introduction
Below I describe some results from a variety of different algorithms to calculate the transfer function for a system.
The general process for the tests was as follows:

Generate an input signal (a time series).

Filter the input signal with a high pass filter to generate an output signal.

Optionally, add noise to the two signals.

Calculate the filter response by comparing the input and output.

Plot and compare the results.
Input Signals
Two different input signals were used, each sampled at 40Hz, with a total duration of 4m10s (10,000 samples). First, a bandlimited whitenoise signal  a simple series of normally distributed random numbers:
Second, to simulate a repeated step, a square wave of 19 cycles:
High Pass Filter
The filter used in the test was generated by http://www.cepd.com/calculators/highpass_dig_first.htm with the signal frequency as above and a knee at 1Hz. This gave coefficients
a0 = 1, a1 = 0.854636, b0 = 0.927318, b1 = 0.927318
The theoretical response for this filter was evaluated in the normal way, substituting z for exp(jwt). This response is show in the plots as a solid line and agrees well with the majority of points.
Additional Noise
To simulate “real” data, additional noise was added to the two signals (input and filtered output). This was normally distributed with an amplitude 5% the original signal.
In all cases the normally distributed noise is generated from the BoxMuller transform and a library PRNG. Neither of these is topquality (BoxMuller is limited to 6 sigma deviations and the PRNG, random(), is a nonlinear additive feedback design), but they should be sufficient for the analysis here (in my experience, PRNG limitations are significant only when measuring error distributions).
Response Calculations
The input and output signals were first processed using Welch’s method  a sliding window of size onetenth the total duration was used, advancing by half a window width, to give a total of 19 separate subspectra, each overlapping its neighbours 50% to each side.
For each window position, the input and output spectra were demedianed and multiplied by a Hann function. The FFT for the two preprocessed spectra were then calculated. The filter response can then be inferred by comparing the Fourier spectra of the input and output windows. In simple terms, the response at a given frequency is the output divided by the input.
The four processing algorithms used differ in exactly how this inferred response is averaged:
Cartesian
This method coadds the complex response for each window and then divides by the total number of windows. It is a simple average, calculated using the usual arithmetic for complex operations (in effect, treating real and complex parts separately). Note that this is an “early” calculation  the response is calculated from each point in each window, and then averaged (cf Late below).
R(f) = < O(f) / I(f) >
where R is the response, as a function of frequency, O is the FFT of the output, I is the FFT of the input, and <> denotes the average over all Welch windows.
Polar
This is similar to Cartesian, but coadds the amplitude and phase separately. Since the amplitude is the magnitude of the signal we would expect the final result to be an overestimate in the presence of noise (just as the mean of 1 and 1 is zero, but the mean of their magnitudes is 1). The final phase is largely meaningless because this value is not “defined on a line” (the mean of 0 and 360 degrees is not 180).
amp(R(f)) = < amp(O(f)) / amp(I(f)) >
pha(R(f)) = < pha(O(f)) / pha(I(f)) >
Late
This approach delays the calculation of the response, choosing instead to combine the average input and output spectra for each window, and then calculating the response from the average input and output spectra. Otherwise, the analysis is similar to Cartesian.
R(f) = <O(f)> / <I(f)>
Correln
This approach delays the calculation of the response (similar to Late), but finds the average of modified input and output signals. Both the input and output signals are multiplied by the complex conjugate of the input at the given frequency before averaging.
R(f) = <O(f) I*(f)> / <I(f) I*(f)>
Note that the denominator in the expression above is the input spectral density.
Results
The results for the white noise signal, with no additional noise, are shown above. The first thing to note is that, in general, the majority of points agree with the expected values (solid line) except for the Polar phase (as expected).
However, there are some additional points worth noting. At the very lowest frequency, all estimations are incorrect (I do not understand why). More generally, over a range of the lowest frequencies, the Correln approach is noticeably more reliable. This bears out a possible justification for that approach: “I believe this is to reduce the effect of data segments where bins of F[x] are excessively small”  nibot at dsp.stackexchange.com. Finally, Late and, to a lesser extent, Polar, are particularly unreliable at low frequencies.
The results for the step signal, with no additional noise, are shown above and confirm the conclusions from the white noise data. The main difference is additional noise / quantisation from the lack of information in the data (noise from numerical rounding is probably significant at low frequencies).
Again, Late is particularly poor and Polar gives meaningless phase results.
Additional Noise
The plots for signals with additional noise are analysed below. For these plots I have used different colours for the different algorithms as it helps identify the “spread” of values; this was not done for the nonoise case because it became too confusing. The symbols used remain consistent across all plots.
The plot above shows the inferred filter amplitude for the white noise signal with additional noise. All algorithms cluster symmetrically around the theoretical line except Polar which is onesided (overestimating, as expected). There is a clear ordering of quality (least noise): Correln is best, followed by Cartesian, Polar and, finally, Late.
The next plot, above, shows the inferred filter amplitude for the step signal with additional noise. Interestingly, here there is more bias: Polar is biased too high (as before), but Correln is biased too low. Late and Cartesian are unbiased, with Cartesian having the smaller spread.
In the phase analysis I have excluded Polar.
The plot for phase gives similar conclusions to the amplitude analysis above: all are symmetrical; Correln is preferred over Cartesian, and Late is much more noisy.
And this is similar to the white noise signal, but noisier. There is no bias seen (cf amplitude, where Correln was biased).
Conclusions
Of the four algorithms, Late is consistently noisiest, while Polar is biased with amplitude and gives meaningless phase.
Correln and Cartesian have lower spread, but Correln was biased in the noise step data. Cartesian is consistently unbiased, but performs worse, particularly when there is little signal.
For reliable, unbiased results Cartesian appears to be the best solution. It is one of the lower noise performers and remains unbiased.
Credits
With grateful thanks to nibot and Phonon at dsp.stackexchange.com.
Related Posts
 Pipes, Clojure and Choco (part 5: Choco)
 An ORM for C?
 Pipes, Clojure and Choco (part 3: Smart Search)
 Pipes, Clojure and Choco  Optimisation
 CMake with Check Unit Tests
 Pipes, Clojure and Choco (part 1: Limits of Laziness)
 Javascript Graphics w/ Raphael
 OpenSSH Certificates
 Pipes, Clojure and Choco (part 2: Direct Search)
 Choco and Regular Expressions