The Inverse Problem

Consider the following linear system of equations

\[{\bf {\rm {\bf A}{\rm x}={\rm y}}}\]

where ${\bf {\rm {\bf A}}}$ is a square design matrix, ${\rm x}$ is a vector of input parameters and ${\rm y}$ is a vector of responses. To estimate unknown inputs from response, the matrix inverse can be used

\[{\rm x={\rm {\bf A}}^{-1}y}\]

However, if a random measurement error $\epsilon$ is superimposed on ${\rm y}$, i.e. $b_{i}=y_{i}+\epsilon_{i}$, the estimate ${\rm \hat{x}}$ from the matrix inverse

\[{\rm \hat{x}={\rm {\bf A}}^{-1}b}\]

becomes dominated by contributions from data error for large systems.

Example

Note

The example system is a test problem for regularization methods is taken from MatrixDepot.jl and is the same system used in Hansen (2000).

Consider the following example system of 100 equations. The matrix $\rm{\bf{A}}$ is 100x100.

100×100 Matrix{Float64}:
 4.71979e-13  4.7207e-11   7.1812e-10   …  0.000123953  3.10037e-5
 4.7207e-11   3.44407e-10  2.18592e-9      0.00027885   0.000123953
 7.1812e-10   2.18592e-9   7.39566e-9      0.000495319  0.000278567
 4.72173e-9   9.94389e-9   2.33509e-8      0.000772739  0.000494292
 1.98422e-8   3.4489e-8    6.52639e-8      0.00111017   0.000770268
 6.32328e-8   9.79521e-8   1.61791e-7   …  0.00150629   0.00110528
 1.67027e-7   2.39805e-7   3.61585e-7      0.00195933   0.00149769
 3.85205e-7   5.24189e-7   7.41078e-7      0.00246701   0.00194536
 8.01584e-7   1.04834e-6   1.41335e-6      0.00302642   0.00244557
 1.53875e-6   1.95195e-6   2.53782e-6      0.00363393   0.00299494
 ⋮                                      ⋱               
 0.00244557   0.00302642   0.00366901      1.04834e-6   8.01584e-7
 0.00194536   0.00246701   0.00305023      5.24189e-7   3.85205e-7
 0.00149769   0.00195933   0.00248243      2.39805e-7   1.67027e-7
 0.00110528   0.00150629   0.00196869      9.79521e-8   6.32328e-8
 0.000770268  0.00111017   0.00151147   …  3.4489e-8    1.98422e-8
 0.000494292  0.000772739  0.00111262      9.94389e-9   4.72173e-9
 0.000278567  0.000495319  0.000773564     2.18592e-9   7.1812e-10
 0.000123953  0.00027885   0.000495319     3.44407e-10  4.7207e-11
 3.10037e-5   0.000123953  0.000278567     4.7207e-11   4.71979e-13

The vector $\rm{x}$ of input variables has 100 elements.

100-element Vector{Float64}:
 0.1079137578052813
 0.12297053360334534
 0.13957600811265403
 0.15779962367244207
 0.17769967237809117
 0.19932086589081324
 0.22269187777914254
 0.2478229164774714
 0.2747033932325347
 0.3032997543151104
 ⋮
 0.43775867610162883
 0.35986269701602813
 0.2923532021212666
 0.23471831462134501
 0.18623171118659587
 0.14602510434716662
 0.11315366183414295
 0.08665170515974066
 0.06557729627915371

Computing $\rm{y}$ and $\rm{b}$ using the pseudo-inverse pinv shows that the error in $\rm{b}$ makes the inversion unusable.

random(n) = @> readdlm("random.txt") vec x -> x[1:n] # hide


y = A * x
b = y + 0.1y .* random(100)
x = pinv(A) * y
x̂ = pinv(A) * b

Tikhonov Regularization

Tikhonov regularization is a means to filter this noise by solving the minimization problem

\[{\rm {\rm x_{\lambda}}}=\arg\min\left\{ \left\lVert {\bf {\rm {\bf A}{\rm x}-{\rm b}}}\right\rVert _{2}^{2}+\lambda^{2}\left\lVert {\rm {\bf L}({\rm x}-{\rm x_{0}})}\right\rVert _{2}^{2}\right\} \]

where ${\rm x_{\lambda}}$ is the regularized estimate of ${\rm x}$, $\left\lVert \cdot\right\rVert _{2}$ is the Euclidean norm, ${\rm {\bf L}}$ is the Tikhonov filter matrix, $\lambda$ is the regularization parameter, and ${\rm x_{0}}$ is a vector of an a priori guess of the solution. The initial guess can be taken to be ${\rm x_{0}}=0$ if no a priori information is known. The matrix ${\rm {\bf A}}$ does not need to be square.

For $\lambda=0$ the Tikhonov problem reverts to the ordinary least squares solution. If ${\rm {\bf A}}$ is square and $\lambda=0$, the least-squares solution is ${\rm \hat{x}={\rm {\bf A}}^{-1}b}$. For large $\lambda$ the solution reverts to the initial guess., i.e. $\lim_{\lambda\rightarrow\infty}{\rm x_{\lambda}}={\rm x_{0}}$. Therefore, the regularization parameter $\lambda$ interpolates between the initial guess and the noisy ordinary least squares solution. The filter matrix ${\rm {\bf L}}$ provides additional smoothness constraints on the solution. The simplest form is to use the identity matrix, ${\rm {\bf L}}={\rm {\bf I}}$.

The formal solution to the Tikhonov problem is given by

\[{\rm x_{\lambda}}=\left({\rm {\bf A}^{T}}{\rm {\bf A}}+\lambda^{2}{\rm {\bf L}^{T}{\rm {\bf L}}}\right)^{-1}\left({\rm {\bf A}^{T}}{\rm b}+\lambda^{2}{\rm {\bf L}^{T}{\rm {\bf L}}{\rm x_{0}}}\right)\]

The equation is readily derived by writing $f=\left\lVert {\bf {\rm {\bf A}{\rm x}-{\rm b}}}\right\rVert _{2}^{2}+\lambda^{2}\left\lVert {\rm {\bf L}({\rm x}-{\rm x_{0}})}\right\rVert _{2}^{2}$, take $\frac{df}{d{\rm {\rm x}}}=0$, and solve for ${\rm x}$. Use http://www.matrixcalculus.org/ to validate symbolic matrix derivatives.

Example

Here is a simple regularized inversion for the same system using ${\rm {\bf L}}={\rm {\bf I}}$ and ${\rm x_{0}}=0$. The regularized solution is

\[{\rm x_{\lambda}}=\left({\rm {\bf A}^{T}}{\rm {\bf A}}+\lambda^{2}{\rm {\bf I}}\right)^{-1}{\rm {\bf A}^{T}}{\rm b}\]

The regularized inverse can be trivially computed assuming a value for $\lambda = 0.11$.

random(n) = @> readdlm("random.txt") vec x -> x[1:n] #hide

y = A * x
b = y + 0.1y .* random(100)
Iₙ = Matrix{Float64}(I, 100, 100)
λ = 0.11
xλ = inv(A'A + λ^2.0 * Iₙ) * A' * b

The solution is not perfect, but it is free of random error and a reasonable approximation of the true $\rm{x}$.

Optimal Regularization Parameter

The choice of the optimal regularization parameter is not obvious. If we pick $\lambda$ too small, the solution is dominated by noise. If we pick $\lambda$ too large the solution will not approximate the correct solution.

random(n) = @> readdlm("random.txt") vec x -> x[1:n] # hide


y = A * x
b = y + 0.1y .* random(100)
Iₙ = Matrix{Float64}(I, 100, 100)
f(λ) = inv(A'A + λ^2.0 * Iₙ) * A' * b
xλ1 = f(0.01)
xλ2 = f(0.1)
xλ3 = f(10.0)

$\lambda = 0.1$ provides an acceptable solution. $\lambda = 0.01$ is noisy (under-regularized) and $\lambda = 10.0$ is incorrect (over-regularized). There are several objective methods to find the optimal regularization parameter. The general procedure to identify the optimal $\lambda$ is to compute ${\rm x_{\lambda}}$ for a range of regularization parameters over the interval [$\lambda_1$, $\lambda_2$] and then apply some evaluation criterion that objectively evaluates the quality of the solution. This package implements two of these, the L-curve method and generalized cross validation.

L-Curve Method

The L-curve method evaluates the by balancing the size of the residual norm $L_{1}=\left\lVert {\bf {\rm {\bf A}{\rm x_{\lambda}}-{\rm b}}}\right\rVert _{2}$ and the size of the solution norm $L_{2}=\left\lVert {\rm {\bf L}({\rm x_{\lambda}}-{\rm x_{0}})}\right\rVert _{2}$ for ${\rm x_{\lambda}}\in[\lambda_{1},\lambda_{2}]$. The L-curve consists of a plot of $\log L_{1}$ vs. $\log L_{1}$. The following example illustrates the L-curve without specifying an a priori input.

random(n) = @> readdlm("random.txt") vec x -> x[1:n] # hide


y = A * x
b = y + 0.1y .* random(100)
Iₙ = Matrix{Float64}(I, 100, 100)
f(λ) = inv(A'A + λ^2.0 * Iₙ) * A' * b
L1(λ) = norm(A * f(λ) - b)
L2(λ) = norm(Iₙ * f(λ))
λs = exp10.(range(log10(1e-5), stop = log10(10), length = 100))
residual, solution = L1.(λs), L2.(λs)

The optimal $\lambda_{opt}$ is the corner of the L-curve. In this example this is $\lambda_{opt} \approx 0.1$, which yielded the acceptable solution earlier. Finding the corner of the L-curve can be automated by performing an gradient descent search to find the mimum value of the curvature of the L-curve (Hansen, 2000). The implementation is discussed in the L-Curve Algorithm section.

The solve function in RegularizationTools can be used to find λopt through the L-curve algorithm, searching over the predefined interval [$\lambda_1$, $\lambda_2$].

random(n) = @> readdlm("random.txt") vec x -> x[1:n] # hide
using RegularizationTools

y = A * x
b = y + 0.1y .* random(100)
solution = @> setupRegularizationProblem(A, 0) solve(b, alg = :L_curve, λ₁ = 0.01, λ₂ = 1.0)
λopt = solution.λ

0.20119675940652682

Generalized Cross Validation

If the general form of the problem

\[{\rm {\rm x_{\lambda}}}=\arg\min\left\{ \left\lVert {\bf {\rm {\bf A}{\rm x}-{\rm b}}}\right\rVert _{2}^{2}+\lambda^{2}\left\lVert {\rm {\bf L}({\rm x}-{\rm x_{0}})}\right\rVert _{2}^{2}\right\}\]

has the smoothing matrix ${\rm {\bf L={\rm {\bf I}}}}$, the problem is considered to be in standard form. The general-form problem can be transformed into standard form (see Transformation to Standard Form for algorithm). If the problem is in standrad form, and if ${\rm x_{0}}=0$, the GCV estimate of $\lambda$ is (Golub et al., 1979):

\[V(\lambda)=\frac{n\left\lVert ({\bf {\rm {\bf {\bf I}-}{\bf A_{\lambda}}){\rm b}}}\right\rVert _{2}^{2}}{tr({\rm {\bf I}-{\rm {\bf A_{\lambda}})}^{2}}}\]

where ${\bf A_{\lambda}}={\rm {\rm {\bf A}}}\left({\bf A}^{\rm T} {\bf A}+\lambda^{2}{\rm {\bf I}}\right)^{-1}{\bf A}^{\rm T}$ is the influence matrix, tr is the matrix trace, $n$ is the size of $\rm{b}$. Note that ${\rm x_{\lambda}}=\left({\rm {\bf A}^{T}}{\rm {\bf A}}+\lambda^{2}{\rm {\bf I}}\right)^{-1}{\rm {\bf A}^{T}}{\rm b}$. Therefore ${\rm {\rm {\rm {\bf A}}}x_{\lambda}={\rm {\bf A_{\lambda}}b}}$ and $\left\lVert ({\bf {\rm {\bf {\bf I}-}{\bf A_{\lambda}}){\rm b}}}\right\rVert _{2}=\left\lVert {\bf {\rm {\bf A}{\rm x_{\lambda}}-{\rm b}}}\right\rVert _{2}$. The optimal $\lambda_{opt}$ coincides with the global minimum of $V(\lambda)$.

The following example evaluates $V(\lambda)$ over a range of $\lambda$.

random(n) = @> readdlm("random.txt") vec x -> x[201:200+n] # hide
y = A * x
b = y + 0.1y .* random(100)
Iₙ = Matrix{Float64}(I, 100, 100)
Aλ(λ) = A*inv(A'A + λ^2.0*Iₙ)*A'
gcv(λ) = 100*norm((Iₙ - Aλ(λ))*b)^2.0/tr(Iₙ - Aλ(λ))^2.0
λs = exp10.(range(log10(1e-3), stop = log10(1), length = 100))
V = map(gcv, λs)

The GCV curve has a steep part for large $\lambda$ and a shallow part for small $\lambda$. The minimum occurs near $\lambda = 0.1$. The solve function in RegularizationTools can be used to find λopt through the GCV approach, searching over the predefined interval [$\lambda_1$, $\lambda_2$].

random(n) = @> readdlm("random.txt") vec x -> x[201:200+n] # hide

using RegularizationTools

y = A * x
b = y + 0.1y .* random(100)
solution = @> setupRegularizationProblem(A, 0) solve(b, alg = :gcv_svd, λ₁ = 0.01, λ₂ = 1.0)
λopt = solution.λ

0.10413405249371731

Note that the objective λopt from the L-curve and GCV criterion are close.

Transformation to Standard Form

The general from couched in terms of $\{\rm\bf{A}, \rm b, \rm x, \rm x_0\}$ and $\rm{\bf{L}}$

\[{\rm {\rm x_{\lambda}}}=\arg\min\left\{ \left\lVert {\bf {\rm {\bf A}{\rm x}-{\rm b}}}\right\rVert _{2}^{2}+\lambda^{2}\left\lVert {\rm {\bf L}({\rm x}-{\rm x_{0}})}\right\rVert _{2}^{2}\right\}\]

is transformed to standard form couched in terms of $\{\rm \bf {\bar A}, \rm \bar b, \rm \bar x, \rm \bar{x}_0\}$ and $\rm{\bf{I}}$

\[{\rm {\rm x_{\lambda}}}=\arg\min\left\{ \left\lVert {\bf {\rm {\bf \bar A}{\rm x}-{\rm \bar b}}}\right\rVert _{2}^{2}+\lambda^{2}\left\lVert {\rm {\bf I}({\rm \bar x}-{\rm \bar x_{0}})}\right\rVert _{2}^{2}\right\}\]

as introduced by Eldén (1977) using notation from Hansen (1998, Chapter 2.3.1). The algorithm computes the explicit transformation using two QR factorizations. The matrices needed for the explicit conversion depend on $\rm \bf A$ and $\rm \bf L$ and are computed and cached in setupRegularizationProblem.

The λopt search is performed in in standard form, and the solution is computed in standard form. Then the resulting solution is transformed back to the general form using the same matrices.

Solving the Standard Equations

The solution ${\rm \bar x_{\lambda}}$ for the transformed standard equation is

\[{\rm \bar x_{\lambda}}=\left({\rm {\bf \bar A}^{T}}{\rm {\bf \bar A}}+\lambda^{2}{\rm {\bf I}}\right)^{-1} \left( {\rm {\bf \bar A}^{T}}{\rm \bar b} + λ^2 \rm \bar {x}_0 \right)\]

${\rm \bar x_{\lambda}}$ is found using the Cholesky decomposition. It is the alorithm used in the MultivariateStats package and generally faster than the QR approach (Lira et al., 2016). In case the Cholesky factorization fails, the default julia equation solver is used.

L-Curve Algorithm

The curvature $\kappa(\lambda)$ of the L-curve is computed using Eq. (18) in Hansen (2000). The expression requires calculation of the solution and residual norms, as well as the first derivative of the solution norm. The derivative is calculated using finite differences from the Calculus package The corner of the L-curve occurs when the curvature maximizes. Finally, $λ_{opt}$ is found by minimizing the $-\kappa(\lambda)$ function (see L-Curve Method) on a bounded interval using Brent's method, as implemented in the Optim package.

Generalized Cross Validation Algorithm

\[V(\lambda)=\frac{n\left\lVert ({\bf {\rm {\bf {\bf I}-}{\bf \bar A_{\lambda}}){\rm \bar b}}}\right\rVert _{2}^{2}}{tr({\rm {\bf I}-{\rm {\bf \bar A_{\lambda}})}^{2}}}\]

The slowest part in the GCV calculation is evaluation of the trace. The GCV estimate is computed either using the single value decomposition (SVD) algorithm (Golub et al., 1979) (gcv_svd) or the explicit calculation using the trace term (gcv_tr). The SVD of the design matrix in standard form $\rm \bf{\bar A}$ is calculated and cached in setupRegularizationProblem. When an initial guess is included, the denominator is computed using the SVD estimate and the numerator is computed via $\left\lVert {\bf {\rm {\bf \bar A}{\rm \bar x_{\lambda}}-{\rm \bar b}}}\right\rVert _{2}^2$ and $\rm \bar x_\lambda$ is obtained using the Cholesky decomposition algorithm solving for the Tikhonov solution in standard form with an initial guess. Note that this is an approximation because the trace term in the denominator does not account for the intial guess. Comparison with the L-curve method suggest that this approximation does not affect the quality of the regularized solution. Finally, $λ_{opt}$ is found by minimizing $V(\lambda)$ on a bounded interval using Brent's method, as implemented in the Optim package.