当前课程知识点:现代优化方法(英文) > Chapter 4. Trust Region Methods > 4.5 Convergence > 4.5 Convergence
Hi, everyone.
Welcome to modern optimization methods.
In this part,
we will introduce the last part of chapter four,
the convergence of the trust region method.
This part will include the following three parts.
First,
we will discuss the reduction obtained by the Cauchy point.
Then we will discuss the global convergence.
Finally, we will discuss some extensions.
Now let's first take a look at the Cauchy point.
So recall that in previous parts,
we have derived the Cauchy point.
So it's basically
a kind of steepest descent method
with a special step length.
And we have the following results
that the Cauchy point
actually satisfies the following conditions.
If we take a look at the condition,
it basically says that the decrease
provided by the Cauchy point in model function
is lower bounded by the right hand side
of this inequality.
So the right hand side is actually related to the norm of gradient,
as well as to a term of minimum functions.
So why this is important by Cauchy reduction?
This result actually set up a bridge
between the trust region method
and its global convergence.
We will show this result in the next page.
So with a sufficient reduction by Cauchy point,
then we can prove the following theorem.
So any optimization point pk
if it is a vector
provided by solving the subproblem,
satisfying the following condition
which is a
proportional of the reduction by Cauchy point.
In that case,
then we can similarly get the
sufficient reduction of this point.
So the reduction provided by this pk
will also be lower bounded
by a term here.
So the term here is similar
to the lower bound for Cauchy reduction,
however different in the ratio c1 here.
It means that
once we have set up the bridge
of Cauchy reduction,
then we can get
other trust region iterations,
which provide better
reduction than Cauchy reduction.
So the proof is actually very simple.
We just use the inequality between the
current iteration pk whose
reduction in model function will be a proportion
of the reduction in Cauchy point.
Then the Cauchy point will also provide a reduction like this.
Then we take the
first part and the last part.
Then we get the final result as required.
So the Cauchy point is a bridge
similar to line search results.
In line search results
the Zoutendijk theorem is a
bridge between the line search
strategy and the convergence result.
Now we have some remarks
about the dogleg method
and the two-dimensional subspace minimization method.
As we may recall that
the dogleg method
and the two-dimensional subspace minimization method
will always provide
better result than Cauchy point.
Therefore,
they will also have the
similar result as we mentioned in the previous page.
And they enjoyed
the lower bound property
provided by the dogleg method
and two-dimensional subspace minimization method.
So for the two ways of solving subproblem,
the model reduction is always satisfied.
Now let's present the first global convergence
theorem for trust region method.
In this part,
we consider the case that η is zero.
Recall in trust region algorithms,
we have a parameter η.
So this η is chosen
from greater than or equal to zero
until 1/4.
So in this case,
we only consider that η is zero.
And we have the following result.
So the result is,
in the bottom of this theorem,
it basically says that
the iteration xk
actually have a weak global convergence result.
So notice that
we have the limit and infinite
the two terms here,
it actually means that
we can find a subsequence
of the norm gradient
such that the limit is zero.
So it is a weak
global convergence result.
Now let's take a look at
the settings as well as the assumptions.
So the settings is in the above.
We assume that in model function,
the Hessian matrix Bk
it has a bounded upper bound.
So the norm of Bk
is upper bounded by a constant β.
Meanwhile,
we assume that
the function f is bounded below on the level set.
So this is reasonable because we are looking for
the minimum of this function.
So it must be lower bounded.
And then we also assume that the function f
is Lipschitz continuously differentiable
in this same neighborhood here.
For this S(Ro),
we assume that
y lies in the level set
and function f
the gradient is Lipschitz continuous
in this small local area.
So it is kind of weaker
Lipschitz continuity called locally Lipschitz continuity.
Further,
we assume that
all approximate solutions of the subproblem
satisfy the inequalities.
So the inequalities in the previous page actually says that
the subproblem
of the trust region will provide
sufficient reduction in model functions.
So with the assumptions here,
we assume that the length of the step pk
will be upper bounded by a ratio
of the trust region radius.
Then we have the weak convergence result.
So it seems that we got a lot of assumptions and settings.
However,
all of them are necessary
to get the weak global convergence result.
We will omit the proof of this theorem.
And we go on to discuss the second case
of the global convergence.
The second case of the global convergence
is actually the case that η
lies strictly between zero and 1/4.
For this case,
we actually need another way
to prove the result.
So we put it in a separate theorem.
The settings and assumptions
are the same as before.
And also the result
here is the same as before.
We still get weak global convergence result.
In a word,
for trust region method
under the certain conditions,
we can have global convergence result.
However,
in the sense of weak global convergence,
because we only got a subsequence
of the norm of gradient converges to zero.
Now,
in terms of the local convergence result,
as we may see that,
locally speaking,
the trust region method
is very similar to the Newton's method
or quasi-Newton's method.
Because in the local sense,
trust region method also do quadratic
approximation to the problem.
And Newton's method and quasi-Newton's method,
they also do quadratic approximation in the local area.
So from this point of view,
the behavior
of trust region method
in the local sense should be similar
like the Newton type methods.
Actually,
theoretically speaking,
we can indeed prove the asymptotically
Newton's behavior.
So the settings
we summarized in this page,
we have some assumptions on f
that f to be twice continuously differentiable
in a neighborhood of the optimal solution.
And the optimal solution,
we require that
the second-order sufficient conditions are satisfied,
which means that gradient to be zero
and Hessian matrix to be positive definite.
And assume that
the sequence converges to x*.
This is a very standard assumption as well.
And for all k sufficiently large,
the trust region method
based on solving the subproblem
with Bk exactly the Hessian matrix.
We choose the step pk
such that
the Cauchy point
sufficient reduction in model function
should be satisfied.
In this case,
we also require that
the Cauchy point model function reduction,
are asymptotically similar to Newton step.
Moreover, in mathematical sense,
we assume that
the current step pk
behaves very similar to the Newton's direction.
So the difference between the two terms
is a high order of Newton's direction.
After that,
under all these settings,
then we can present the local convergence result
for trust region methods.
Now with the settings above,
then we can have that
there is a trust region bound Δk become inactive.
So what means inactive?
In other words,
when the trust iteration goes on,
we can find some trust region radius Δk
that it will not take effect to the problem.
In other words,
the subproblem will reduce to an unconstrained problem.
In this case,
xk will converge superlinearly to x*.
Because locally speaking,
it indeed behaves
similar to the Newton's method.
So this is the local convergence
behavior of trust region method.
We need more assumptions in this theorem.
However,
these assumptions are indeed satisfied in real situations.
Now let's talk about some extensions.
The first extension is the scaling.
Scaling means that in some situations,
we need to do the scaling on the variables
in order to make this problem well defined.
In terms of trust region method,
notice that in trust region method,
we actually assume that
the variables behave similar for each of its components.
However, if for some particular components,
it varies in the large area
and for some other particular components,
it varies a little bit.
So in this case,
we need the scaling technique.
So how do we do the scaling?
The scaling basically
we can introduce the
diagonal matrix D
to the trust region area.
So originally we
restrict the trust region step length p to be smaller than
or equal to Δ.
Now we add a diagonal matrix D
before p
meaning that we ask the rescaled length of p
to within this area.
So by this rescaling,
then the problem seems to have better situation.
Now by this diagonal matrix,
then if we take a look at the model function
and the subproblem again,
so we got extra diagonal p over there.
When f is highly sensitive to the value of some components,
and it's not the sensitive to some value of some components.
Then we can introduce such diagonal matrix D
to represent the sensitivity of the components.
The diagonal matrix
can be derived from the Hessian matrix of the problem.
So introducing the diagonal matrix
may provide us some new problems.
For example,
for Cauchy point calculation,
if we got a scaling restriction,
then we need to derive the Cauchy point again.
And we can actually have similar results
as the normal Cauchy point.
So for this part,
we call it the generalized Cauchy point calculation.
In other words,
the model function is the same.
However,
the constraint changes a little bit.
Then we'll get different formulae.
So the formulae we present the result here,
but we omit the details how we derive it.
Now,
the last extension is that
we can also do the trust region in some other ways of norm.
For example,
naturally we use the Euclidean norm.
However,
in real practice,
when you need then you can choose L1 norm
or the L∞ norm
to measure the length of the vector.
So in that sense,
it is called the trust region method in L1 norm
or trust region method in L∞ norm.
We can also use some extensions,
for example,
the scaled L1 norm or scaled L∞ norm.
All of them are available or acceptable.
So for different real situations.
We may define different extensions
based on the application algorithms.
Now let's make a brief summary about this part.
In this part,
we actually introduce the global convergence
of trust region method,
as well as its convergence rate.
All of the theorems here we didn't provide the details.
I leave you as homework to derive the results.
Moreover,
we derive some extensions,
including scaling and other norms for trust region methods.
Now let's make a brief summary about chapter four.
Chapter four mainly discusses the trust region methods,
including the main idea,
the algorithms,
as well as how we solve the subproblem.
Finally,
we also consider the convergence performance of the methods.
We stop here and see you next time.
-1.1 About optimization
-1.2 Classification of optimization
--1.2 Classification of optimization
-1.3 Preliminaries in convex analysis
--1.3 Preliminaries in convex analysis
-课件
-【作业须知】
-练习题
-2.1 What is a Solution?
-2.2 Optimality Conditions Ⅰ
-2.3 Optimality Conditions Ⅱ
-2.4 Line search strategy
-2.5 Search direction Ⅱ
-2.6 Convergence
-课件
-【作业须知】
-作业
-3.1 Exact Step Length
-3.2 The Wolfe conditions
-3.3 Inexact Line Search II
-3.4 Convergence of Line Search Methods
--3.4 Convergence of Line Search Methods
-3.5 Convergence Rate
-课件
-【作业须知】
-作业
-4.1 Main Idea of Trust Region Methods
--4.1 Main Idea of Trust Region Methods
-4.2 Trust-Region Algorithm
-4.3 Solving Subproblem
-4.4 Solving Subproblem II
-4.5 Convergence
-课件
-【作业须知】
-作业
-5.1 Conjugate direction method
--5.1 Conjugate direction method
-5.2 Property of conjugate direction method
--5.2 Property of conjugate direction method
-5.3 Conjugate gradient method
--5.3 Conjugate gradient method
-5.4 Rate of convergence
-5.5 Nonlinear conjugate gradient method
--5.5 Nonlinear conjugate gradient method
-5.6 Convergence of nonlinear conjugate gradient method
--5.6 Convergence of nonlinear conjugate gradient method
-课件
-【作业须知】
-作业
-6.1 Semismoothness
-6.2 Semismooth Newton's Method
-6.3 Support Vector Machine
-6.4 Semismooth Newtons' Method for SVM
--6.4 Semismooth Newtons' Method for SVM
-6.5 Exploring Sparsity in SVM
--6.5 Exploring Sparsity in SVM
-课件
-【作业须知】
-作业
-7.1 Local and Global Solutions
--7.1 Local and Global Solutions
-7.2 Examples One
-7.3 Examples Two and Three
-7.4 Constraint Qualifications
--7.4 Constraint Qualifications
-7.5 First-Order Optimality Conditions
--7.5 First-Order Optimality Conditions
-7.6 Second Order Necessary Condition
--7.6 Second Order Necessary Condition
-7.7 Second Order Sufficient Condition
--7.7 Second Order Sufficient Condition
-7.8 Duality
-课件
-【作业须知】
-作业
-8.1 KKT conditions
-8.2 An Example
-8.3 Dual Problem
-课件
-【作业须知】
-测试题
-9.1 Quadratic Penalty Method
--9.1 Quadratic Penalty Method
-9.2 Exact Penalty Function
-9.3 Augmented Lagrangian Method
--9.3 Augmented Lagrangian Method
-9.4 Quadratic Penalty Method for Hypergraph Matching
--9.4 Quadratic Penalty Method for Hypergraph Matching
-9.5 Quadratic Penalty Method for Hypergraph Matching: Ⅱ
--9.5 Quadratic Penalty Method for Hypergraph Matching: Ⅱ
-9.6 Augmented Lagrangian Method for SVM
--9.6 Augmented Lagrangian Method for SVM
-9.7 Augmented Lagrangian Method for SVM: Ⅱ
--9.7 Augmented Lagrangian Method for SVM: Ⅱ
-课件
-【作业须知】
-作业