当前课程知识点:现代优化方法(英文) > Chapter 2. Fundamentals of Unconstrained Optimization > 2.5 Search direction Ⅱ > 2.5 Search direction Ⅱ
Hello, everyone.
Welcome to modern optimization methods.
In this part,
we will introduce the fifth part of chapter two,
the search directions.
This part includes the three kinds of directions,
Newton's direction, quasi-Newton directions,
as well as conjugate gradient directions.
Now firstly, let's talk about Newton's direction.
So Newton's direction is
actually given by the formula like this.
So we have pkN,
here this N represent Newton's direction.
It is given by the inverse of Hessian matrix at xk.
Then multiplied by the gradient of f at xk
and then we take the minus direction.
So this is Newton's direction.
It seems that the calculation is a bit complicated.
So let's first introduce the intuition
why we can have this formula.
The intuition is by the Taylor's expansion of f(xk+p) at xk.
So we can expand the function value of f to its second order.
And we basically approximate this function f
by the second order quadratic functions.
So if we take p as our variables and we take expansions,
then the first term is f(xk ) as before.
The second term is gradient.
And the third term is the Hessian matrix of f at xk.
So we write the right hand side
of this equation as the mk(p).
It is a function.
We are looking for the direction p.
So if we try to minimize this mk(p),
then we will get the Newton's direction.
So if I draw a picture geometrically,
it will give you the following result.
So we assume that we got a function here.
This is f.
And we assume we get xk here.
Then basically Newton's direction says,
we start at xk and we approximate this function curve
by a quadratic function.
So this is the quadratic function.
And xk is here.
So instead of minimizing this f(x),
we minimize this quadratic function.
So the function will return us
an xk here,
x(k+1) here.
So the direction from xk to x(k+1)
will be the Newton's direction.
Here.
So this give us the intuition of Newton's direction.
Geometrically, it means that we approximate function
at a point xk by a quadratic function.
And we solve this quadratic function to its minimizer.
So the direction between xk and x(k+1)
will be the Newton's direction.
Now we have several remarks about Newton's direction.
As we can see that we approximate this function
by a quadratic one at a neighborhood of xk.
So if this approximation is within a small region,
then it is very reliable.
This corresponds to the first remark.
Newton's direction is reliable when f this curve,
and the quadratic curve
are approximate to each other very well.
The second remark basically means that
Newton's direction is a descent direction under some conditions.
So recall that we have mentioned
what is descent direction in the previous class.
It basically says that the direction
takes the inner product with the gradient
must be strictly smaller than zero.
If we take Newton's direction to do the inner product,
then under some conditions,
we will have this inner product smaller than zero.
So what is the condition?
So we have several conditions,actually,
to make Newton's direction exist.
Firstly, the Hessian matrix must be invertible.
Secondly, the Hessian matrix must be positive definite
to guarantee the sufficient decrease.
So these are two assumptions to
make Newton's direction a descent direction.
The third remark is that in this case,
we actually take the step length to be unit step length.
And we can also modify it to a nonunit step length,
meaning that we can add a line search strategy.
We will talk about the details later in chapter three.
Now, we can show that
the quadratic convergence actually holds.
So what is quadratic convergence?
We will introduce this in the last part of chapter two.
However, a disadvantage is that for Newton's direction,
it may have heavy computational cost.
Why?
Because it involves the calculation of Hessian matrix.
So recall that Hessian matrix, it is a matrix of n by n,
meaning that you have to calculate
n square elements in this matrix.
So it involves computational cost, which is very heavy.
Now let's take a look at the quasi-Newton directions.
Given the fact that
Newton's direction have a lot of requirement,
for example, it requires the inverse of Hessian matrix exists.
And also it requires that
the Hessian matrix must be positive definite.
Can we modify a little bit
to get around of these conditions?
So this comes the quasi-Newton methods.
In quasi-Newton methods,
we do not use the Hessian matrix directly.
Instead, we use the matrix Bk
to approximate the Hessian matrix.
So this gives the quasi-Newton direction pk
as I showed in this equation.
And we replace the Hessian matrix
by an approximate matrix Bk.
So the advantage of
quasi-Newton methods is obvious.
Firstly, it does not require the
calculation of Hessian matrix,
calculation of Hessian matrix,
because we use Bk instead.
And meanwhile
it still keeps a superlinear convergence rate,
which is fast.
Finally, you may ask how we can get this Bk?
By which way
we can approximate the Hessian matrix very well.
So this comes that
why we use the quasi-Newton methods.
So the main idea is that
we approximate Hessian matrix by Bk firstly.
The second one is that how
we can get the next Hessian matrix approximation,
that means B(k+1).
How we can get the next approximation.
So this is very important.
In this page,
we will mainly discuss
how we can derive B(k+1) based on Bk.
So actually Bk has to satisfy some conditions.
Now the first condition is that
it must satisfy the following secant condition,
that means B(k+1)
in sk direction will be the same as yk.
So what is sk?
sk is the difference of the two successive iterations.
And yk is the difference of the two successive gradients.
So Bk the second equation basically says that
the Bk along the direction sk
after a multiplication,
then it must be the same as yk.
So this is one condition that
we can use to get the next B(k+1).
The other condition is that B(k+1) must be symmetric.
And also we must make sure that
when we generate B(k+1) from Bk,
the computational cost must be small,
which means that
we can make that the difference
between the two matrices must be low rank.
In other words,
we can make a bit modification to Bk to get B(k+1).
Based on the several rules here,
we propose a class of quasi-Newton directions.
Notice that the above three conditions
does not lead to unique B(k+1),
it leads to a class of B(k+1).
So we call the quasi-Newton methods as
methods rather than method.
So there are two most popular updating formulae,
which I list here.
The first one is called
the symmetric rank one update short for SR1.
the symmetric rank one update, short for SR1.
It actually modifies Bk by a rank one modification.
And then we can reach B(k+1).
So if we take a look at the second term here,
it is actually a rank one matrix.
Why?
Because we can take this vector yk minus Bk sk as a vector.
Then this vector multiplied the transpose of
this vector will become a rank one matrix.
this vector will become a rank one matrix.
And the denominator here is a constant.
So totally speaking, this Bk is a rank one update.
Let's take a look at the
second update formula, the BFGS formula.
It was originally proposed by Broyden,
Fletcher,
Goldfarb and Shanno.
It is actually a rank two modification.
So why this modification is rank two?
If we take a look at this formula,
we find there are two terms after Bk.
So each term actually is a rank one matrix.
So it is a rank two modification.
Now, let's talk about the inverse of Bk.
Recall that in quasi-Newton direction,
Bk is the approximation of Hessian matrix.
So actually, in quasi-Newton direction,
we need the inverse of Bk.
Therefore, is it possible to
derive an update formula for the inverse of Bk?
The answer is yes.
We use H to denote the inverse of Bk.
Then we can get the
following corresponding update formulae for Hk.
So Hk is the inverse of Bk and we can similarly
have the two updating formulae for the inverse of Bk.
So H(k+1) is modified by a rank one constraint.
And for the BFGS formula,
the update tends to be a little bit complicated.
However, it is still rank two update.
So this is the update formula for quasi-Newton methods.
Now let's take a look at the third direction,
conjugate gradient direction.
In conjugate gradient direction, from the name,
you can see that it is related to gradient.
There is also some conjugate involved.
Conjugacy will be introduced in chapter five.
We do not introduce the notation here.
So I just give the formula for the conjugate direction.
It actually has the form like the sum of two vectors.
The first vector is the steepest descent direction,
which is kind of gradient.
The second vector is p(k-1).
This p(k-1) is actually the previous direction.
So recall that we have current direction as pk,
so p(k-1) is the previous direction.
So this is the update for conjugate gradient direction.
It is kind of combination
between the previous update direction
as well as the current gradient direction.
What is the intuition for conjugate gradient direction?
You may ask this question.
Actually, conjugate gradient directions
were originally proposed for solving the linear systems here,
Ax=b where this A is positive definite.
And actually, we can show that
this linear system is equivalent
to solving the quadratic optimization problem,
which is a minimization problem here.
So this is the connection between solving linear system
as well as solving optimization problem.
Now, what is the benefit
of the conjugate gradient directions?
As we can see from the direction,
it actually only involves gradient information
on the previous direction.
So the computational cost is very cheap.
This is one advantage.
On the other hand,
it actually needs less storage,
Because it only involves vectors.
Moreover it is much more effective
than using the steepest descent direction.
So this is compared with the steepest descent direction.
It will be more efficient than the steepest descent method.
Actually, it is a very powerful direction
to solve the optimization problems.
We will demonstrate more details in chapter five.
Now let's make a brief summary about this part.
In this part,
we mainly talk about three search directions.
All of them are descent directions under some conditions.
So the first one is Newton's direction.
The second one is quasi-Newton direction.
And finally the conjugate gradient direction.
So together with steepest descent directions,
they are the search directions
under the line search strategies.
So in the next part,
we will introduce the trust region strategy,
which is another way to update the next iterate.
We stop here and see you next time.
-1.1 About optimization
-1.2 Classification of optimization
--1.2 Classification of optimization
-1.3 Preliminaries in convex analysis
--1.3 Preliminaries in convex analysis
-课件
-【作业须知】
-练习题
-2.1 What is a Solution?
-2.2 Optimality Conditions Ⅰ
-2.3 Optimality Conditions Ⅱ
-2.4 Line search strategy
-2.5 Search direction Ⅱ
-2.6 Convergence
-课件
-【作业须知】
-作业
-3.1 Exact Step Length
-3.2 The Wolfe conditions
-3.3 Inexact Line Search II
-3.4 Convergence of Line Search Methods
--3.4 Convergence of Line Search Methods
-3.5 Convergence Rate
-课件
-【作业须知】
-作业
-4.1 Main Idea of Trust Region Methods
--4.1 Main Idea of Trust Region Methods
-4.2 Trust-Region Algorithm
-4.3 Solving Subproblem
-4.4 Solving Subproblem II
-4.5 Convergence
-课件
-【作业须知】
-作业
-5.1 Conjugate direction method
--5.1 Conjugate direction method
-5.2 Property of conjugate direction method
--5.2 Property of conjugate direction method
-5.3 Conjugate gradient method
--5.3 Conjugate gradient method
-5.4 Rate of convergence
-5.5 Nonlinear conjugate gradient method
--5.5 Nonlinear conjugate gradient method
-5.6 Convergence of nonlinear conjugate gradient method
--5.6 Convergence of nonlinear conjugate gradient method
-课件
-【作业须知】
-作业
-6.1 Semismoothness
-6.2 Semismooth Newton's Method
-6.3 Support Vector Machine
-6.4 Semismooth Newtons' Method for SVM
--6.4 Semismooth Newtons' Method for SVM
-6.5 Exploring Sparsity in SVM
--6.5 Exploring Sparsity in SVM
-课件
-【作业须知】
-作业
-7.1 Local and Global Solutions
--7.1 Local and Global Solutions
-7.2 Examples One
-7.3 Examples Two and Three
-7.4 Constraint Qualifications
--7.4 Constraint Qualifications
-7.5 First-Order Optimality Conditions
--7.5 First-Order Optimality Conditions
-7.6 Second Order Necessary Condition
--7.6 Second Order Necessary Condition
-7.7 Second Order Sufficient Condition
--7.7 Second Order Sufficient Condition
-7.8 Duality
-课件
-【作业须知】
-作业
-8.1 KKT conditions
-8.2 An Example
-8.3 Dual Problem
-课件
-【作业须知】
-测试题
-9.1 Quadratic Penalty Method
--9.1 Quadratic Penalty Method
-9.2 Exact Penalty Function
-9.3 Augmented Lagrangian Method
--9.3 Augmented Lagrangian Method
-9.4 Quadratic Penalty Method for Hypergraph Matching
--9.4 Quadratic Penalty Method for Hypergraph Matching
-9.5 Quadratic Penalty Method for Hypergraph Matching: Ⅱ
--9.5 Quadratic Penalty Method for Hypergraph Matching: Ⅱ
-9.6 Augmented Lagrangian Method for SVM
--9.6 Augmented Lagrangian Method for SVM
-9.7 Augmented Lagrangian Method for SVM: Ⅱ
--9.7 Augmented Lagrangian Method for SVM: Ⅱ
-课件
-【作业须知】
-作业