当前课程知识点:现代优化方法(英文) > Chapter 9. Penalty and Augmented Lagrangian Methods > 9.6 Augmented Lagrangian Method for SVM > 9.6 Augmented Lagrangian Method for SVM
Hi. Welcome to modern optimization methods.
Today we will introduce the sixth part
of chapter nine,
the augmented Lagrangian method
applied to support vector machine.
This is the outline of this part.
We will introduce the support vector machine problem.
Then we apply the augmented Lagrangian method
to solve the L1 loss SVC.
Finally, we will discuss some theoretical properties
and then some numerical results.
Now let's first recall the support vector machine.
So the support vector machine is
divided into support vector classification
and support vector regression.
So in this part,
we mainly focus on the support vector classification problem.
Our method can also be applied
to support vector regression problems.
Now recall that in support vector classification,
we are trying to search for a hyperplane
such that the two types of data can be separated
by this hyperplane as much as possible.
So when the data cannot be separated strictly,
we use some loss functions
to measure the misclassification.
This leads to the L1 loss-SVC here
So in the L1 loss SVC model,
we actually measure the misclassification
by the L1 function,
which leads to the following problem here.
And we mainly focus on this problem in this part.
So firstly,
to study this problem,
let's first reformulate this x
by adding one extra component.
Then by adding x the training set
by one extra component
and ω by one extra component.
Then the problem is reduced to the following L1 loss SVC
without bias term here.
So you got this, the difference lies here
that we actually ignore or omit this b
which is basically represented in ω and xi.
So the unbiased term,
the L1 loss SVC model is like this.
Now we try to solve this problem
by using the augmented Lagrangian method.
So how we can achieve this goal?
Now we first write this problem
in the form of matrix and vector case,
or vector formula.
So the max term here
that we can reformulate as an L1 norm
of a max function.
Then the inner part is Bω+d,
this matrix B actually is made up of the training data set
xi as well as the label.
I think you can easily write down the form of B.
And d is actually a vector of all ones.
After that,
we can introduce an extra variable s,
and we replace the second term here by a function p(s),
and then s is the term of Bω+d.
Then by doing this,
the problem is actually reformulated as a nonsmooth part p(s)
plus a smooth part.
And this s has some equality constraint here.
So now for this problem,
the variable here is ω and s.
We got constraints here.
So for the equality constrained problems,
we can then apply the augmented Lagrangian function.
So the augmented Lagrangian function
is written as follows.
We just put the original f function
and the Lagrangian function here.
Then add the quadratic penalty term here.
This will lead to the augmented Lagrangian function.
And λ here
are corresponding to the equality constraints,
σ is the penalty parameter.
So the augmented Lagrangian method
actually takes the following update.
Recall that the primal parameters are ω and s,
there are two blocks for variables.
And we need to solve them with respect to ω and s
together to get the minimizer.
Then after that,
we update λ by the following formula,
which is a very traditional way.
So the main focus is actually how
we solve this subproblem with respect to ω and s.
Now the key points,
as we mentioned just now,
the first one is to solve the subproblem,
the second one is that
if we can manage to solve the subproblem,
then what is the convergence result
for the augmented Lagrangian method?
So these are the two questions
we are going to consider.
Now let's take a look at some preliminaries.
To answer question one,
we have to use some important tools.
The first tool is called the Moreau-Yosida regularization.
This is a basic tool in designing
nonsmooth optimization methods.
So to introduce the Moreau-Yosida regularization,
we have to be equipped with some basic settings
where we let the capital X
to be a real finite dimensional space
equipped with Euclidean distance and inner products.
Its induced norm is denoted
by the natural norm notation.
Then we assume that this q
is a closed convex function
from the space X to the real scalar function.
And then after these settings,
the Moreau-Yosida regularization of q
at point x is defined in the following way.
So if we take a detailed look at this
equation or this problem,
it is actually this θq(x) this function
with respect to x.
It is a minimal function value
of the right hand side here.
So this minimal function value
is actually a least square between y and x,
plus the function q(y).
So this minimization,
is actually looking for the minimizer of y
which makes this objective function smallest.
So in other words,
given x,
then we can solve this minimization problem
to get the optimal solution y
and the optimal function value
will be the function q(x).
So if this x changes,
then this optimal value y also changes
and the resulting optimal value will also change,
as well given the function value θq(x) here.
So this is called the
Moreau-Yosida regularization.
And for the optimal solution
of this minimization problem,
we give it a notation as Proxq(x).
So this is optimal solution y*.
And we call this as the proximal point of x
associate with function q.
So this page is actually very important
in our following analysis.
Now we have a very important proposition
in terms of Moreau-Yosida regularization.
So the property basically says that
if we got this q to be a closed convex function,
and θ is the Moreau-Yosida regularization of q,
and the proximal point is denoted as this.
Then we can calculate or firstly,
we have that this function q
is continuously differentiable.
Although the explicit form of q is not available,
however,
we can prove that this θq
is continuously differentiable.
We can have the gradient like this.
We can write down the gradient of this function
as the difference of x,
minus its proximal point.
In other words,
for this q function,
we can calculate its gradient by x itself
as well as the proximal term.
This will be used in solving subproblems in our case.
Now getting back to the subproblem in our case,
we have that we try to minimize
the augmented Lagrangian function with respect to ω and s
at the same time.
So the idea to solve this subproblem is like this.
We first solve the
function with respect to s
to get a formula with respect to ω,
then we solve the minimization problem
in terms of ω.
Now let's explain it in more details here.
Firstly,
we minimize this augmented Lagrangian function
in terms of s,
we try to solve this.
Given ω and λ,
we solve the minimization of s,
then we reformulate the augmented Lagrangian function
by a constant term here.
So this term is a constant with respect to s,
then this part is related to s.
So it is equivalent to minimizing
least square term here
as well as a proximal function here.
Therefore,
if you look at this part,
this part is actually sort
of Moreau-Yosida regularization.
So we denote the optimal solution
of this part as σθp(z(ω)).
And we assume that
we can solve this explicit formula,
then here this z(ω) is actually this term.
You can easily calculate the function here.
And if we write s in terms of this form,
then we actually eliminate the s variable.
Then the next step is that
we try to solve the minimization of Φ
in terms of ω.
Moreover,
if we take gradient with respect to ω for Φ,
then by the Moreau-Yosida properties,
we can see that the gradient of Φ
is equal to ω plus the gradient of θp.
So by Moreau-Yosida decomposition,
we can have this gradient of θp like this.
In other words,
we can have the gradient of Φ in an explicit form.
So this will provide us
a more practical way to solve the subproblem.
Now based on the analysis before,
we solve the minimization problem in the following way.
Firstly,
we solve the minimization of Φ in terms of ω,
we find the optimal solution ω*.
After that,
the s* is actually a proximal term of this z(ω*).
So this is actually,
we just do calculation to get this s*.
So the main focus is actually
how we solve this ω*.
And solving ω* is actually equivalent
to solving the stationary point.
The first order optimality condition.
Now if we take a look at the formula of Φ,
the gradient of Φ,
it takes the following formula.
In this formula,
we can have this s*(ω),
s*(ω) is actually the proximal term of z(ω).
Therefore,
to solve the gradient to be zero,
we need to fix the three questions here.
The first one is that
we have to derive the proximal mapping formula,
because it appears here, there is an s*.
The second one is
what is the generalized Jacobian of Φ.
Because we need to use the semismoothness property.
The third one is
how we can reduce the computational cost.
Because we need to make sure that
this algorithm is very fast.
Let's make a brief summary about this part.
In this part,
we basically investigate the augmented Lagrangian method
to solve the L1 loss SVC,
and we derive the framework
of augmented Lagrangian method to solve it.
We also use an important tool
called Moreau-Yosida regularization.
In the next part.
We will talk about
how we can fix the three key points
and design the efficient algorithm.
We stop here and see you next time.
-1.1 About optimization
-1.2 Classification of optimization
--1.2 Classification of optimization
-1.3 Preliminaries in convex analysis
--1.3 Preliminaries in convex analysis
-课件
-【作业须知】
-练习题
-2.1 What is a Solution?
-2.2 Optimality Conditions Ⅰ
-2.3 Optimality Conditions Ⅱ
-2.4 Line search strategy
-2.5 Search direction Ⅱ
-2.6 Convergence
-课件
-【作业须知】
-作业
-3.1 Exact Step Length
-3.2 The Wolfe conditions
-3.3 Inexact Line Search II
-3.4 Convergence of Line Search Methods
--3.4 Convergence of Line Search Methods
-3.5 Convergence Rate
-课件
-【作业须知】
-作业
-4.1 Main Idea of Trust Region Methods
--4.1 Main Idea of Trust Region Methods
-4.2 Trust-Region Algorithm
-4.3 Solving Subproblem
-4.4 Solving Subproblem II
-4.5 Convergence
-课件
-【作业须知】
-作业
-5.1 Conjugate direction method
--5.1 Conjugate direction method
-5.2 Property of conjugate direction method
--5.2 Property of conjugate direction method
-5.3 Conjugate gradient method
--5.3 Conjugate gradient method
-5.4 Rate of convergence
-5.5 Nonlinear conjugate gradient method
--5.5 Nonlinear conjugate gradient method
-5.6 Convergence of nonlinear conjugate gradient method
--5.6 Convergence of nonlinear conjugate gradient method
-课件
-【作业须知】
-作业
-6.1 Semismoothness
-6.2 Semismooth Newton's Method
-6.3 Support Vector Machine
-6.4 Semismooth Newtons' Method for SVM
--6.4 Semismooth Newtons' Method for SVM
-6.5 Exploring Sparsity in SVM
--6.5 Exploring Sparsity in SVM
-课件
-【作业须知】
-作业
-7.1 Local and Global Solutions
--7.1 Local and Global Solutions
-7.2 Examples One
-7.3 Examples Two and Three
-7.4 Constraint Qualifications
--7.4 Constraint Qualifications
-7.5 First-Order Optimality Conditions
--7.5 First-Order Optimality Conditions
-7.6 Second Order Necessary Condition
--7.6 Second Order Necessary Condition
-7.7 Second Order Sufficient Condition
--7.7 Second Order Sufficient Condition
-7.8 Duality
-课件
-【作业须知】
-作业
-8.1 KKT conditions
-8.2 An Example
-8.3 Dual Problem
-课件
-【作业须知】
-测试题
-9.1 Quadratic Penalty Method
--9.1 Quadratic Penalty Method
-9.2 Exact Penalty Function
-9.3 Augmented Lagrangian Method
--9.3 Augmented Lagrangian Method
-9.4 Quadratic Penalty Method for Hypergraph Matching
--9.4 Quadratic Penalty Method for Hypergraph Matching
-9.5 Quadratic Penalty Method for Hypergraph Matching: Ⅱ
--9.5 Quadratic Penalty Method for Hypergraph Matching: Ⅱ
-9.6 Augmented Lagrangian Method for SVM
--9.6 Augmented Lagrangian Method for SVM
-9.7 Augmented Lagrangian Method for SVM: Ⅱ
--9.7 Augmented Lagrangian Method for SVM: Ⅱ
-课件
-【作业须知】
-作业