当前课程知识点:现代优化方法(英文) > Chapter 6. Semismooth Newton's Method > 6.5 Exploring Sparsity in SVM > 6.5 Exploring Sparsity in SVM
Hi. Welcome to modern optimization methods.
Today we will introduce the last part of chapter six,
exploring sparsity in SVM.
Now recall that
when we apply the semismooth Newton's method to solve the
support vector classification and support vector regression,
we need to solve a linear system
in order to get the semismooth Newton's direction.
Now we take a detailed look at how we solve the linear system.
And we will show how we
explore the sparsity of the optimal solution
to significantly reduce the computational cost.
Now first let's take a look at the conjugate gradient method.
We are trying to solve the linear equations in the following form.
I use this V.
In the semismooth Newton's method,
in each direction,
we will solve this sort of linear equation
in order to get this direction d
here this V is selected from the generalized jacobian set.
And h is a given right hand side.
Now let's recall that the dimension of this V
and recall that this ω is in n-dimensional space.
Therefore, the matrix V is an n by n size of matrix.
A question is how do we solve this linear equation
to get the d?
So we apply the conjugate gradient method,
as you may guess that when we use the conjugate gradient method,
the most time-consuming part is to solve
or to calculate the matrix multiplied by a vector.
So how do we calculate this matrix multiplied by a vector?
Here we do some further analysis.
First, we take a look at the formula of V.
Recall that the set V1 takes the following form.
It is composed by two parts.
The first part is the identity matrix.
The second part is actually a sum of matrices.
For each term we got hi multiplied by xi(xi^T).
In other words, for each term here,
it is a rank one matrix, due to different values of h,
this term will be a little bit different.
Now let's take a look at how we do the
matrix multiplied by a vector Δh.
We take a look at this term.
This is a vector in n-dimensional space.
And this is a matrix in n by n space.
We will take a look at how we do this calculation.
The natural calculation, if we multiplied this by this one,
then it will lead us to order O(n^2) computational cost.
Can we reduce this further to save some computational cost?
This is a question we ask.
And below we will do some analysis.
Now let's take a look at how we can calculate this multiplication.
If we substitute the form of V into the multiplication,
then we can get the multiplication will be the Δω itself,
which is a vector.
And then the sum term there.
For the sum term,
we can see that if we multiplied by this Δω,
then we actually get a
constant or a scalar xi transpose Δω which is a scalar.
Then this scalar multiplied by a vector xi.
In other words, if we do not calculate this term directly,
however we do the summation of the right hand side,
then we may save some computational cost.
So how do we further to save more?
Now let's take a look at this h.
Recall that h is the element in the generalized Jacobian
of this max function.
Then we can see that actually h is either zero or one
or between zero and one.
In other words, we do not multiply h all the time,
we just pick up those nonzero h.
If zi(ω) is positive, then hi is one.
Then we just get this term here.
And if zi(ω) is zero,
then we drop it.
We do not consider that.
If it is zero and this hi is the value between zero and one.
In this situation,
what value can we choose to reduce the computational cost?
You can see that the answer is very simple.
We can choose this μ to set it zero.
In this case,
by setting h to be zero,
then we do not consider the two parts here.
The left part is those zi(ω)>0.
We define an index set,
Ik whose zik(ωk) is positive.
We collect those indices and for those indices,
we do the summation.
Otherwise they are zero and we do not have to
do the calculation.
By doing this,
we can see that the computational cost can be reduced.
For this term, there is no computational cost.
And for this term,
we only need to do part of vector multiplication.
We summarize
the computational cost of the original one
and the improved one in this table.
Then you can see that the original one
will take the complexity of I times n order,
while the reduced one,
actually we only need to take the order of
2|Ik| elements multiplied by n.
The absolute value of Ik actually indicates
the number of elements in Ik.
You may wonder that
what is the size of the elements in Ik.
If there are many elements in Ik,
then we cannot get so many benefits.
Below I will show by a figure
that how big or how small the element in Ik is.
This is an example that for all the indices.
The x axis is the iteration number.
For example, if k equals two,
this is the number of Ik when k is two.
And the right hand figure actually
plot out the ratio of the number in Ik overall
the number of samples.
So the second figure actually indicates that
when this k goes on,
then the ratio will go up or go bigger firstly,
then it will decrease.
Finally, the ratio will be around 10%,
which means that the computational cost we save is about
at least 10% of the original one,
which is significantly reduced.
Now in the next figure,
we demonstrate the quadratic convergence rate
for some examples.
Here I listed two examples to show the convergence rate.
The x axis is the number of iterations,
and the y axis is the log of the norm of gradient.
If we observe the decreasing line like this,
then it indicates that the convergence rate is quadratic.
As we can see that for different parameters here,
all the algorithm, they will have quadratic convergence rate.
It is the same for another example here.
We also observe that
for the two examples here, it takes less than 10 iterations
to terminate the algorithm.
In other words,
this algorithm takes a very small number of iterations.
It actually converges very fast.
This is the demonstration of convergence rate.
Now, let's see some numerical results.
This is the numerical results that we compare with LIBLINEAR.
Remember that LIBLINEAR is the state-of-art solver
for SVM problem in linear kernel case.
We compare our semismooth Newton's method with
DCD and TRON.
So A1 is DCD, A2 is TRON,
and A3 is our semismooth Newton's method.
We report the CPU time t in seconds,
as well as the accuracy in percentage.
So here are the different names of the data sets.
As you can see that computational time of the three methods are
similar to each other.
And in terms of accuracy,
I marked the winner in bold.
You can see that the three methods are actually comparable
with each other.
If we take a look at the
support vector regression problem,
and the result is actually very different.
As you can see, B1 is DCD,
B2 is TRON and B3 is our semismooth Newton's method.
As you can see that the CPU time is also comparable
with each other.
However, in terms of MSE which is a
measure of errors.
The lower MSE is,
the better the performance is.
So you can see that compared with TRON and DCD,
the semismooth Newton's method actually
performs significantly better for the data here.
Sometimes for some data like the housing data,
and eunite 2001 data,
they actually perform very much better than TRON and DCD.
In other words,
when applying the semismooth Newton's method to the
support vector regression problem,
the performance of the semismooth Newton's method is much
better than the existing state-of-art method.
Now, if you are interested in semismooth Newton's method for SVM,
you can find more details in the publication here.
Let's make a summary about this part.
We basically apply semismooth Newton's method
to solve the support vector machine problem,
the L2-loss related models.
And we can get global convergence rate as well as
quadratic local convergence rate,
especially when applying semismooth Newton's method to
support vector regression model,
the performance is much better than the existing
state-of-art solvers.
Now here are some related reference about this chapter
and if you're interested, you can download them from website.
Now let's make a brief summary about this chapter.
In this chapter, we introduced semismooth Newton's method,
including the algorithm,
the globalized version of algorithms,
as well as local and global convergence result.
Moreover, we connect the semismooth Newton's method with the
classification methods and we apply to solve the
SVC and SVR model.
The resulting performance is very good.
This is the end of chapter six.
See you next time.
-1.1 About optimization
-1.2 Classification of optimization
--1.2 Classification of optimization
-1.3 Preliminaries in convex analysis
--1.3 Preliminaries in convex analysis
-课件
-【作业须知】
-练习题
-2.1 What is a Solution?
-2.2 Optimality Conditions Ⅰ
-2.3 Optimality Conditions Ⅱ
-2.4 Line search strategy
-2.5 Search direction Ⅱ
-2.6 Convergence
-课件
-【作业须知】
-作业
-3.1 Exact Step Length
-3.2 The Wolfe conditions
-3.3 Inexact Line Search II
-3.4 Convergence of Line Search Methods
--3.4 Convergence of Line Search Methods
-3.5 Convergence Rate
-课件
-【作业须知】
-作业
-4.1 Main Idea of Trust Region Methods
--4.1 Main Idea of Trust Region Methods
-4.2 Trust-Region Algorithm
-4.3 Solving Subproblem
-4.4 Solving Subproblem II
-4.5 Convergence
-课件
-【作业须知】
-作业
-5.1 Conjugate direction method
--5.1 Conjugate direction method
-5.2 Property of conjugate direction method
--5.2 Property of conjugate direction method
-5.3 Conjugate gradient method
--5.3 Conjugate gradient method
-5.4 Rate of convergence
-5.5 Nonlinear conjugate gradient method
--5.5 Nonlinear conjugate gradient method
-5.6 Convergence of nonlinear conjugate gradient method
--5.6 Convergence of nonlinear conjugate gradient method
-课件
-【作业须知】
-作业
-6.1 Semismoothness
-6.2 Semismooth Newton's Method
-6.3 Support Vector Machine
-6.4 Semismooth Newtons' Method for SVM
--6.4 Semismooth Newtons' Method for SVM
-6.5 Exploring Sparsity in SVM
--6.5 Exploring Sparsity in SVM
-课件
-【作业须知】
-作业
-7.1 Local and Global Solutions
--7.1 Local and Global Solutions
-7.2 Examples One
-7.3 Examples Two and Three
-7.4 Constraint Qualifications
--7.4 Constraint Qualifications
-7.5 First-Order Optimality Conditions
--7.5 First-Order Optimality Conditions
-7.6 Second Order Necessary Condition
--7.6 Second Order Necessary Condition
-7.7 Second Order Sufficient Condition
--7.7 Second Order Sufficient Condition
-7.8 Duality
-课件
-【作业须知】
-作业
-8.1 KKT conditions
-8.2 An Example
-8.3 Dual Problem
-课件
-【作业须知】
-测试题
-9.1 Quadratic Penalty Method
--9.1 Quadratic Penalty Method
-9.2 Exact Penalty Function
-9.3 Augmented Lagrangian Method
--9.3 Augmented Lagrangian Method
-9.4 Quadratic Penalty Method for Hypergraph Matching
--9.4 Quadratic Penalty Method for Hypergraph Matching
-9.5 Quadratic Penalty Method for Hypergraph Matching: Ⅱ
--9.5 Quadratic Penalty Method for Hypergraph Matching: Ⅱ
-9.6 Augmented Lagrangian Method for SVM
--9.6 Augmented Lagrangian Method for SVM
-9.7 Augmented Lagrangian Method for SVM: Ⅱ
--9.7 Augmented Lagrangian Method for SVM: Ⅱ
-课件
-【作业须知】
-作业