_reei-supplement:
Supplementary Materials¶
Additional Considerations¶
Matrix Equations¶
The purpose of this translation was originally to motivate the writing of the scikit to which it is attached. As such, the calculations shown for linear regressions in the paper can be greatly simplified using the power of existing linear algebra solvers. The particular implementations discussed in this section are targeted towards numpy and scipy in particular. However, the formulation shown here will most likely be helpful for a number of other modern linear least-squares packages. This is especially true since many software packages, including numpy and scipy, share LAPACK as a backend for performing the actual computations.
With a linear algebra package such as LAPACK, it is sufficient to compute
the coefficient matrix and ordinate vector
, without
performing any further operations on them. A linear regression is generally
solved as something equivalent to
The formulae in the paper show how to compute the elements of
and
, which is usually done more efficiently by existing
packages when given the raw
and
. These variable names were
chosen to avoid conflict with the names
,
,
,
, etc, which are used heavily by the paper and software package
documentation for different things.
The following sections show how to construct such simplified solutions to the equations in the paper. Solutions are described briefly, and presented concretely with Python code. The solutions here are for conceptual reference only. They are not a complete or exact reflection of how things are done in the scikit itself. The scikit makes an effort to optimize for speed over legibility, which does not suit the purpose of this exposition.
In the code snippets below, x
and y
are assumed to be one-dimensional
numpy arrays. Both arrays have an equal number of elements, n
. Numpy is
implicitly imported under the conventional name np
. The solution to each
regression can be obtained by running one of the following two function calls:
params, *_ = scipy.linalg.lstsq(M, p)
params, *_ = numpy.linalg.lstsq(M, p)
Gaussian PDF¶
Algorithm originally presented in 3. Example: Illustration of the Gaussian Probability Density Function and summarized here.
is a matrix with the cumulative sums
and
as
the columns. In numpy terms:
d = 0.5 * np.diff(x)
S = np.cumsum((y[1:] + y[:-1]) * d)
S = np.insert(S, 0, 0)
xy = x * y
T = np.cumsum((xy[1:] + xy[:-1]) * d)
T = np.insert(T, 0, 0)
M = np.stack((S, T), axis=1)
is the vector of measured
values decreased by its first
element. In numpy terms:
p = y - y[0]
Gaussian CDF¶
Algorithm originally presented in Appendix 2: Linear Regression of the Gaussian Cumulative Distribution Function and summarized here.
is a matrix with just the
values and ones as the columns.
In numpy terms:
M = np.stack((x, np.ones_like(x)), axis=1)
is a more complicated function of
in this case:
p = scipy.special.erfinv(2 * y - 1)
Exponential¶
Algorithm originally presented in 2. Regression of Functions of the Form y(x) = a + b \; \text{exp}(c \; x) and summarized here.
In the first regression to obtain the parameter , we have for
:
d = 0.5 * np.diff(x)
S = np.cumsum((y[1:] + y[:-1]) * d)
S = np.insert(S, 0, 0)
M1 = np.stack((x - x[0], S), axis=1)
is the vector of measured
values decreased by its first
element. In numpy terms:
p1 = y - y[0]
Once we find a fixed value for (as the variable
c1
), we can fit
the remaining parameters with constructed from
and
. In numpy terms:
M2 = np.stack((np.ones_like(x), np.exp(c1 * x)), axis=1)
is just the raw
values for this regression. In numpy terms:
p2 = y
Weibull CDF¶
Algorithm originally presented in 3. Regression of the Three-Parameter Weibull Cumulative Distribution Function and summarized here.
This is just a transposed and transformed version of the Exponential case.
The input and
are first swapped, and then the new
values are transformed according to (34). In numpy terms:
x, y = np.log(-np.log(1.0 - y)), x
Sinusoid with Known Frequency¶
Algorithm originally presented in 2. Case Where \omega is Known A-Priori, in equation (42). The algorithm is not summarized in detail since it deals with a trivial case that is not particularly useful in practice.
The matrix is a concatenation of the final coefficients in
(41). Assuming that the angular frequency
from the equations is given by the Python variable
omega
, we have:
t = omega * x
M = np.stack((np.ones_like(x), np.sin(t), np.cos(t)), axis=1)
is just the vector of raw
values for this regression. In
numpy terms:
p = y
Integral-Only Sinusoidal Regression Method¶
Algorithm originally presented in 3. Linearization Through an Integral Equation, in equation (51). This equation consitutes the first step in the complete process summarized in Appendix 1: Summary of Sinusoidal Regression Algorithm and detailed in Appendix 2: Detailed Procedure for Sinusoidal Regression.
is a matrix constructed from the second-order cumulative sum
and powers of
:
d = 0.5 * np.diff(x)
S = np.cumsum((y[1:] + y[:-1]) * d)
S = np.insert(S, 0, 0)
SS = np.cumsum((S[1:] + S[:-1]) * d)
SS = np.insert(SS, 0, 0)
M = np.stack((SS, x**2, x, np.ones_like(x)), axis=1)
is just the vector of raw
values for this regression. In
numpy terms:
p = y
Further Optimization of Sinusoidal Phase and Frequency¶
The additional optimization step presented in 5. Further Optimizations Based on Estimates of a and \rho, in equation (57). This equation consitutes the second step in the complete process summarized in Appendix 1: Summary of Sinusoidal Regression Algorithm and detailed in Appendix 2: Detailed Procedure for Sinusoidal Regression.
for a purely linear regression is just ones and
concatenated column-wise:
M = np.stack((x, np.ones_like(x)), axis=1)
is less trivial, since it the vector
, shown in
(56). This computation assumes that we already have estimates for
,
,
and
stored in the
variables
a1
, b1
, c1
and omega1
from the previous step in the
algorithm:
rho1 = np.hypot(c1, b1)
phi1 = np.arctan2(c1, b1)
f = y - a1
dis = rho**2 - f**2
m = (dis >= 0)
Phi = np.arctan(np.divide(f, np.sqrt(dis, where=m), where=m), where=m)
np.copysign(np.pi / 2.0, f, where=~m, out=Phi)
kk = np.round((omega1 * self.x + phi1) / np.pi)
theta = (-1)**kk * Phi + np.pi * kk
p = theta
Due to limitations in numpy, kk
must be a floating-point datatype for the
expression (-1)**kk
to work properly. Any negative values in kk
would
raise ValueError: Integers to negative integer powers are not allowed.
Final Optimization of Sinusoid Once Frequency is Found¶
The final optimization step is presented in 6. Final Steps and Results, in equation (59). This equation consitutes the third step in the complete process summarized in Appendix 1: Summary of Sinusoidal Regression Algorithm and detailed in Appendix 2: Detailed Procedure for Sinusoidal Regression.
We assume that the optimized value is stored in the variable
omega2
to compute :
t = omega2 * x
M = np.stack((np.ones_like(x), np.sin(t), np.cos(t)), axis=-1)
is just the vector of
values:
p = y
Computation and Plotting of Randomized Figures¶
Some of the figures in the paper contain too many data points to place conveniently in a table. In such cases, the translation attempts to reproduce the results with randomly generated data following the paper’s instructions. The number of points used in such cases is sufficient to allow all the results to match within acceptable thresholds.
Reference implementation of the code used to generate the various radomized figures and tables is shown below, along with brief explanations. The figures are regenerated every time the documentation is built, so results will always vary slightly.
Cumulative Distributions and Medians¶
In measuring the accuracy of the sinusoidal regression algorithm, a number of
figures showing cumulative distributions of a ratio of anglar frequencies is
used for demonstration: in Fig. 9,
Fig. 10, reei-sin-h-plot
,
reei-sin-i-plot
and reei-sin-k-plot
. The code to
simulate fits with
(
n
in the code ) points per period
goes something like this:
omega_e = 2.0 * np.pi
x = np.random.rand(N, n) # One simulation per row
x.sort(axis=1)
y = np.sin(omega_e * x)
omega = [self.fit(*k) for k in zip(x, y)]
Here fit
is a function that accepts the and
vectors
and returns an estimate for
. The exact function will differ
depending on whether we are discussing
,
or
. For sufficiently large
, there is a good likelihood
that the values of
will contain NaN due to pathological input
data (where
is imaginary), so we filter the list:
ratios = np.array([f / omega_e for f in omega if not np.isnan(f)])
The loss of one or two elements for very large will not affect the
PDF and CDF much:
pdf, bins = np.histogram(ratios, bins=500, density=True)
pdf *= np.diff(bins) # Convert density into mass
bins = 0.5 * (bins[1:] + bins[:-1]) # Use bin centers
cdf = np.cumsum(pdf)
The median value of the distribution is obtained through linear interpolation:
med = np.interp(0.5, cdf, bins)
Extended Applications¶
Some additional common functions and modifications that can be added to the suite presented in the paper are presented here.
Three-Parameter Gaussian¶
In the real world, data often comes out as an unnormalized Gaussian of three, rather than two parameters, with the third parameter being the amplitude or normalization. The normalization constant of the Gaussian PDF is chosen so that the total area under the curve is unity:
The amplitude of the PDF at the peak is therefore
. An unnormalized Gaussian can be
conveniently parametrized either in terms of its total area
, or peak
amplitude
, in addition to
and
. We have the
relationship:
Either is a valid choice for the third parameter. The scikit uses the amplitude
because it is easier to visualize, and makes the computation just a
little easier. We therefore wish to use the methodology presented in the
original paper to find the parameters
,
and
which optimize the fit to some set of data points
of our
function:
The formulation of the integral equation (3) is still
applicable because the amplitude parameter is not resolvable from those
equations. We are able to solve for the approximations and
as before (but with a different
, which absorbs
the multiplicative constant).
Optimizing will require an additional step. Fortunately, with
and
fixed, the minimization of residuals in terms of
is already a linear problem:
Setting the partial derivative of the sum of squared errors with respect to
to zero:
Since we are solving for only one unknown, a single equation is produced:
Errata¶
The following list of errata was complied during the translation. All items link back to the corresponding location in the translated paper as footnotes.
[errata-reei-1] | The original list of function conventions read The portion |
[errata-reei-2] | (1, 2) The original equations (2) and (3) read: (61)¶ (62)¶ Taking the integral |
[errata-reei-3] | The original equation (7) had
(63)¶ This is inconsistent with the explicit note at the beginning of the section, and with equations (4), (5) and (6). |
[errata-reei-4] | The original expression for (64)¶ The correction is due to the error in the equations (3) and
(2)[errata-reei-2]. Notice that |
[errata-reei-5] | An extension of [errata-reei-3] to the summary section. The
right-most matrix had ![]() ![]() |
[errata-reei-6] | An extension of [errata-reei-2] to the summary section. The
formula for ![]() |
[errata-reei-7] | The original paper lists equation [9]. There is no equation [9] in the paper, and contextually, it makes sense that the reference is in fact to equation [1]. |
[errata-reei-8] | The equation numbers have been moved back by two. The number in the appendix of the original paper starts with 11. However, numbers 9 and 10 are missing from the last numbered equation (which was (8)). There are four unlabeled equations in Appendix 1, not 2. |
[errata-reei-9] | The table was extracted from the original figure and added as a separate entity. A caption was added to reflect the caption of Table 1. |
[errata-reei-10] | The figure is listed as 11 in the original paper. I am pretty sure that is an outright typo. |
[errata-reei-11] | The original paper has the figure number listed as 1 here, but should be 2. |
[errata-reei-12] | The equation was moved to a separate line, for consistency with all the other numbered equations in the paper. |
[errata-reei-13] | The version of (50) in the original paper
has an unbalanced closing parenthesis after ![]() |
[errata-reei-14] | It is a bit unclear as to how the ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
[errata-reei-15] | The figure shown here is generates identical results to the
one in the original paper, but it is not strictly correct. For each value of
omega_e = 2.0 * np.pi
x = np.linspace(0.0, 1.0, p)
y = np.sin(omega_e * x)
This is not quite correct. The actual x-values should be generated with
|
[errata-reei-16] | The original figure shows the “exact” ![]() ![]() ![]() |
[errata-reei-17] | The original list of points was written as
![]() ![]() ![]() |