当前课程知识点:现代优化方法(英文) > Chapter 9. Penalty and Augmented Lagrangian Methods > 9.7 Augmented Lagrangian Method for SVM: Ⅱ > 9.7 Augmented Lagrangian Method for SVM: Ⅱ
Hi. Welcome to modern optimization methods.
Today we will introduce the last part of chapter nine,
applying augmented Lagrangian method
to solve L1-Loss SVM.
Now recall that
we are considering the L1-Loss SVM.
We try to minimize
a quadratic norm of ω
plus a nonsmooth function p(s).
Meanwhile,
we have some equality constraints here.
To solve this problem,
we apply the augmented Lagrangian method.
For the augmented Lagrangian method,
recall that the main focus
in the method is how to solve the subproblem.
We listed three key points
when solving the subproblem.
In the previous part,
we talked about how
or what are the key points
to solve this subproblem.
Next, we will address the key points one
by one to show how we tackle them.
Now let's take a look at the first one,
how to calculate the proximal mapping.
Recall that the function p(s)
take the following form,
where it is basically the sum of
the max function here.
So the proximal function
is actually the optimal solution
for this minimization problem.
So this minimization is actually
called the Moreau-Yosida regularization.
Basically given a vector z.
We are trying to minimize
the distance between z and s,
then plus another function p(s).
For this problem,
in order to get the optimal solution,
we can actually transfer this problem
into the number of m-dimensional problem
into the number of one-dimensional problem.
It is basically a quadratic
minimization problem for one variable.
Then after that,
we can discuss that
due to different cases,
we can actually get different optimal solution values.
So this leads to the proximal
formula for this function.
And then we tackle the problem one.
Now let's take a look at the subdifferentiable
of this proximal mapping.
This is in preparation to derive the generalized Jacobian.
So the subdifferentiable of
this proximal term is actually some matrix,
which is a diagonal matrix.
So the diagonal element μ,
actually is the sort of generalized
gradient of max function.
If you can see that
if zi is greater than some constant
then it is one,
if it is between zero and a constant,
then it is zero.
If it's smaller than zero,
it is still one.
Between zi in zero and this constant,
it is zero.
So when zi reaches this point,
say zi equals zero or CM,
then the subdifferential
will be an interval μi here.
We demonstrate the figure
of proximal mapping in the left hand side.
Compared to the traditional L1
norm proximal mapping,
it's kind of shift between the l1
mapping to this one.
So this is the demonstration
for proximal mapping,
as well as how we derive the subdifferentiable
of the proximal mapping.
Now let's talk about
how we can get the generalized
Jacobian of the gradient φ.
As we mentioned in chapter 6,
the generalized Jacobian
is actually difficult to obtain
if you got very complicated functions.
However, by the chain rule here,
we can actually identify
a larger set of the generalized gradient.
In other words,
when multiplied by a vector h,
then the generalized Jacobian
of φ can be characterized by a larger set.
We call it a generalized gradient hat.
We can characterize this part.
For this larger set,
the formula is like this.
We got a constant matrix,
which is identity plus
a constant matrix as well.
Then the third part is a set.
Because there is a subdifferential
of the proximal mapping,
which we have discussed before.
By doing like this,
we can characterize
the generalized Jacobian as well.
Now, we give the globalization
of semismooth Newton's method.
We actually introduced
this globalization version in chapter six.
We just copy this algorithm here
to apply it to solve our subproblem.
The basic idea is that
we apply the line search strategy,
where the step αk
is provided by the backtracking strategy.
The search direction dk
is obtained by semismooth
Newton's method here.
When we apply semismooth
Newton's method to get the direction,
we have two issues to pay attention to.
The first one is that
this matrix V must be selected
from this larger set of generalized Jacobian.
Recall that we got this hat
to denote a larger set
of generalized Jacobian of φ.
So we choose V from this set.
The second issue is that
when we solve this direction d,
we need conjugate gradient method.
We did not solve it directly.
However, we apply the iterative
solver conjugate gradient
method to solve it
in order to save computational cost.
We will talk about this later for this issue.
Now, we take a look
at how we apply the conjugate
gradient method to solve this linear system.
The aim is that
we try to reduce the computational cost.
Notice that this V, takes the following form.
So this is one element
from the generalized Jacobian set.
Actually, if we save this V explicitly,
then we need a lot of memory cost
as well as computational cost,
because you got a matrix multiplication here.
Therefore, a practical way is that
how we can calculate this Vh directly,
instead of saving V and then doing calculation.
So we do this Vh calculation directly
instead of saving V and doing Vh.
So if you calculate Vh by this formula,
then we can see that the main
part we need to worry about is this one.
So there is an identity matrix
minus the diagonal matrix.
Notice that this μ
is either zero or one.
So in order to save our computational cost,
we need to find a way
to make sure that this part
contains as many zeros as possible.
So how can we choose this?
For this μ,
if it is one,
then the difference of this part will be zero.
Therefore, we need to choose μ as many one as possible.
Therefore, we use this part.
This I(z) actually denotes that zi
between zero and this constant.
In other words, if zi lives in this interval,
then μ will be zero.
In other words,
if we make difference of this part,
then this will lead the index of B
in terms of this row and the transpose
by this B in terms of the row here.
So we can measure how small this I(z) is.
If it contains less elements,
than we can save a lot of computational cost.
Now in this table,
we actually summarize different ways
of computational cost.
So the third way is actually what
we use in our implementation.
We can see that the computational
cost can be significantly
reduced from the previous one to here.
So if I(z) is smaller,
then we save more computational cost.
Now finally,
let's discuss the convergence result.
For the global convergence result,
we use the theorem in this paper.
And we can see that finally,
under the settings here,
we can guarantee that the sequence
of ωk and sk will be bounded
and converge to the unique optimal
solution of the original problem.
This is actually sort of global convergence result.
Next to present the local convergence result,
we need some stopping criteria.
So here I present different types of stopping criteria.
However, I will not go to details
to show why they come and how we implement it.
We use this stopping criterion
to get the following convergence
result for the augmented Lagrangian method.
So the first result is that
the sequence λk will converge to λ*,
which is one of the augmented Lagrange multiplier.
Moreover, the iteration here,
the distance of λ(k+1) to ω will be bounded
by the distance of the previous iteration to ω.
And this bounded constant βk
will turn to a constant which is smaller than one.
In other words,
the augmented Lagrange multiplier
has the linear convergence rate.
If we have more conditions on the stopping criteria,
then we can get that the update iteration ωk
will also enjoy the linear convergence rate.
Here this βk is actually a constant
which is related to k and it will turn to a constant,
which hopefully is smaller than one.
Now finally, let me present some numerical results.
Here we also compare our method
with the state-of-art solvers LIBLINEAR.
And we compare with the DCD method,
dual coordinate descent method.
So these are the CPU times.
These are the accuracy.
As you can see that
our algorithm is a bit
faster than the DCD method.
In many cases,
we will return a bit higher accuracy than DCD method.
In particular, there are several examples
showing that the augmented Lagrangian method
can actually significantly improve the algorithm
of DCD method because it gets much higher accuracy
than DCD does.
Now finally, let me conclude this part.
We basically introduce the augmented Lagrangian method
to solve the L1-loss SVC problem.
If you're interested in this part,
you can download the reference here to see more details.
Finally, for the chapter of this part that
we conclude in the following,
so we introduce the quadratic penalty method,
exact penalty method,
as well as augmented Lagrangian method
to solve constrained optimization problem.
We also present two examples to show how
we can apply efficient methods to solve the modern
optimization problems.
So this is the end of this course.
Thank you. Goodbye.
-1.1 About optimization
-1.2 Classification of optimization
--1.2 Classification of optimization
-1.3 Preliminaries in convex analysis
--1.3 Preliminaries in convex analysis
-课件
-【作业须知】
-练习题
-2.1 What is a Solution?
-2.2 Optimality Conditions Ⅰ
-2.3 Optimality Conditions Ⅱ
-2.4 Line search strategy
-2.5 Search direction Ⅱ
-2.6 Convergence
-课件
-【作业须知】
-作业
-3.1 Exact Step Length
-3.2 The Wolfe conditions
-3.3 Inexact Line Search II
-3.4 Convergence of Line Search Methods
--3.4 Convergence of Line Search Methods
-3.5 Convergence Rate
-课件
-【作业须知】
-作业
-4.1 Main Idea of Trust Region Methods
--4.1 Main Idea of Trust Region Methods
-4.2 Trust-Region Algorithm
-4.3 Solving Subproblem
-4.4 Solving Subproblem II
-4.5 Convergence
-课件
-【作业须知】
-作业
-5.1 Conjugate direction method
--5.1 Conjugate direction method
-5.2 Property of conjugate direction method
--5.2 Property of conjugate direction method
-5.3 Conjugate gradient method
--5.3 Conjugate gradient method
-5.4 Rate of convergence
-5.5 Nonlinear conjugate gradient method
--5.5 Nonlinear conjugate gradient method
-5.6 Convergence of nonlinear conjugate gradient method
--5.6 Convergence of nonlinear conjugate gradient method
-课件
-【作业须知】
-作业
-6.1 Semismoothness
-6.2 Semismooth Newton's Method
-6.3 Support Vector Machine
-6.4 Semismooth Newtons' Method for SVM
--6.4 Semismooth Newtons' Method for SVM
-6.5 Exploring Sparsity in SVM
--6.5 Exploring Sparsity in SVM
-课件
-【作业须知】
-作业
-7.1 Local and Global Solutions
--7.1 Local and Global Solutions
-7.2 Examples One
-7.3 Examples Two and Three
-7.4 Constraint Qualifications
--7.4 Constraint Qualifications
-7.5 First-Order Optimality Conditions
--7.5 First-Order Optimality Conditions
-7.6 Second Order Necessary Condition
--7.6 Second Order Necessary Condition
-7.7 Second Order Sufficient Condition
--7.7 Second Order Sufficient Condition
-7.8 Duality
-课件
-【作业须知】
-作业
-8.1 KKT conditions
-8.2 An Example
-8.3 Dual Problem
-课件
-【作业须知】
-测试题
-9.1 Quadratic Penalty Method
--9.1 Quadratic Penalty Method
-9.2 Exact Penalty Function
-9.3 Augmented Lagrangian Method
--9.3 Augmented Lagrangian Method
-9.4 Quadratic Penalty Method for Hypergraph Matching
--9.4 Quadratic Penalty Method for Hypergraph Matching
-9.5 Quadratic Penalty Method for Hypergraph Matching: Ⅱ
--9.5 Quadratic Penalty Method for Hypergraph Matching: Ⅱ
-9.6 Augmented Lagrangian Method for SVM
--9.6 Augmented Lagrangian Method for SVM
-9.7 Augmented Lagrangian Method for SVM: Ⅱ
--9.7 Augmented Lagrangian Method for SVM: Ⅱ
-课件
-【作业须知】
-作业