当前课程知识点:微软亚洲研究院大数据系列讲座 > 第五讲:软件分析中的大数据研究(张洪宇) > 软件分析的概念(the concepts of software analytics) > 软件分析的概念(the concepts of software analytics)
Hi, welcome to our lecture on software analytics using big data
My name is Hongyu Zhang. I'm a researcher at Microsoft research.
This lecture is a part of the big data course
offered by Microsoft research Asia.
Over the years, there is a large
amount of data produced by the software engineering practice.
The aim of software analytics is to
utilize this data to improve software practice.
In this lecture, I will show you the concepts of software analytics
I will also give some examples of software analytics tasks.
Thank you for watching!
We are now in an era of big data. Data is surrounding us and is everywhere
We also generate a large amount of data every minute.
For example, people generate about two hundred million emails every minute
and about four billion Google queries.
We can see that,
we generate a large amount of data so we could
utilize this data. We could mine this data.
to help us better understand ourselves,
our customers. And we could then
build better tools to improve the quality of our life and
to improve the productivity of our work.
In this era we also see that dramatic change of software.
Software systems are no longer shipped with a CD-ROM in a box
In stead, more and more software systems
becomes large-scale online systems.
For example, Microsoft office is now online, which is called Office 365. Subscribed
users can access office features any time, anywhere.
Nowadays the software systems are also
getting larger and more and more complex.
The way software is built and operated
is also changing. Traditionally, software is built following a
Waterfall model by a small group of developers.
Now, the development of software systems
is increasingly distributed, and
agile for some open source systems.
Developers all around the world
could contribute to the code and fix bugs.
Over the years of software development,
a vast amount of software engineering data has been produced.
This data could come from long-term evolution of
industrial software systems.
It could also come from open-source projects such as those from GitHub.
There are many kinds of software engineering data,
for example, source code, crash reports,
bugs, metrics, logs,
Customer information, and so on. The amount of data keeps increasing.
As an example, we show the growth of source code size for eclipse.
Eclipse is a widely used open-source IDE.
It has gone through a long period of evolution. In 2001,
Eclipse 1.0 was released. After that,
It has been a major release almost every year.
The figure in this slide shows the growth of eclipse source code size.
meatured in terms of the number of java files
number of compiled classes, and the lines of code.
We can see that the size of eclipse goes up quickly.
Eclipse 1.0 has about half a million lines of code,
while Eclipse 3.3 has nearly 2,000,000 lines of code.
The average growth has been 260,000 length of code
for each major release.
Over the years, a large amount of source code data has been generated.
The other example of software engineering data is error report.
Also a lot of software quality assurrance mesures have been
taken. Released products still contain bugs.
Many bugs could lead to software crash.
Once a crash happens, users could choose to send the crash report
to the product developers. Microsoft has deployed a system.
Which can collect and analyze
crash reports sent from windows users.
This system is called windows
error reporting system, or WER for short.
It was reported that about
1,000,000 error reports were collected within
8 months of WER deployment. The number of error reports
processed by WER increased by a factor of 30
from 2003 to 2009. We can see that
over the years a large amount of
error report data has being generated.
Another example of software engineering data is
project data. Github is a popular open-source project hosting website.
It was founded in 2008.
As of January 2013, Github has already
received 2,000,000 users and hosted
4,900,000 projects. By december 2013,
Github contains about ten million projects in total.
Each project has project unique data
such as project summary, creation date, contributors, modification history,
programming language used and so on. We can see that over the years,
a large amount of project data has being generated.
This software engineering data can be largely classified into three types.
The data related to the operation of software.
The data related to the users of software and the data related to
the product itself. The software operation data includes
runtime traces, program logs, events,
performance counters and so on. These data are
generated from the daily operation of the software.
The software user data includes the data from the users.
For example, the usage log, the user surveys,
online posts, blogs and so on.
The software product data includes source code,
bug reports, check-in history
test cases and so on. These data
are generated from the product and process of the software.
There are many data sources from which we could collect our data.
for example, the source code repositories
store changes to the code from which we could
collect the source code changes.
The defect tracking systems store the bug related information and
track the resolution of bugs from the defect tracking systems
we could collect the defect related data.
The emails recall the project communications among
the developers from which we could better understand
the rationales for decisions throughout the life of a project.
The error reporting system collects the crash
related data. The metric tools can collect
all the process and product related data.
The project hosting website cantains the meta project data.
All these data can be collected automatically and can be used
for improving software process and products.
The goal of software analytics is to enable software
practitioners perform data
explorasion and analysis in order to obtain insightful and actionable
knowledge for real world tasks.
We have a large amount of data generated over the years.
We could mine these data and to obtain insightful and
actionable knowledge. Then, we could build better tools,
and make better decisions in order to improve the quality of the
software and the productivity of the software development process.
Software analytics covers different aspects of software development.
Throughout the entire software life cycle, we can collect
data related to software systems such as the source code,
bug report,execution logs, code complexity metrics,
peformance counters, and so on.
We can then analyze these data to obtain insights about
the quality status of the software system. We can also
collect data from software development process.
From the evolution process
For example, the data about software process,
and the data about cost and effort , developer data,
source code change data, and so on. We can then analyze these data to
obtain insights of the productivity of the developers.
We can also collect data from software
users such as click-through data, usage data, location data,
user profile, and so on. We can then analyze these data to obtain insights
about the actual use of the software.
All the obtained insights can be used to help software
practitioners produce better software.
Software analytics utilize many technologies.
we call them them technology pillars. One technology pillar
is big data computing, which leverages the power of machines.
We could use Microsoft Azure, Apache Hadoop,
Spark, elastic search,
to process a large amount of structure and
unstructured data in an efficient way.
The other technology pillar is about information visualization,
which leverages the power of human. We human are good at visualization.
and can quickly identify patterns from the images or charts.
Another technology pillar is related to data mining,
machine learning and information retrieval, which connects human
and machines. We can apply classification, clustering, ranking, regression.
All such techniques
to mine knowledge from the software engineering data.
Software analytics can help software developers
and testers to improve their productivity.
It can also help other software practitioners such as
program managers, support engineers and operation engineers to better
understand the software engineering data they collecte, to help them
obtain more insights and make better decisions they can use the results of
software analytics to verify, reject some
hypothosis, discover some new insights.
build classification of prediction models
or construct knowledge base and query against it.
In the rest of the this talk I will
show some example of software anlylitics tasks.
-什么是大数据(What is big data?)
-为什么大数据是当前热点(Why big data is a nature phenomenon?)
--为什么大数据是当前热点(Why big data is a nature phenomenon?)
-新的计算基础设施和工具(New Infrastructure and tools)
--新的计算基础设施和工具(New Infrastructure and tools)
-课程简介(Course Introduction)
-基础设施,机器学习和可视化(Infrastructure,Machine Learning and Visualization)
--基础设施,机器学习和可视化(Infrastructure,Machine Learning and Visualization)
-大数据与传统商业智能的区别(Big data:different from traditional BI)
--大数据与传统商业智能的区别(Big data:different from traditional BI)
-Quiz
--Quiz--作业
-大规模超文本网络搜索引擎的解析(the anatomy of a large scale hypertextual web search engine)
--大规模超文本网络搜索引擎的解析(the anatomy of a large scale hypertextual web search engine)
-搜索引擎如何实现每秒数千次的查询(How does a web search engine process thousands of queries per second?)
--搜索引擎如何实现每秒数千次的查询(How does a web search engine process thousands of queries per second?)
-探寻搜索的多个维度(finding dimensions for queries)
--探寻搜索的多个维度(finding dimensions for queries)
-Quiz
--Quiz--作业
-背景介绍(background)
-用户移动规律的理解-1(user mobility understanding-1)
--用户移动规律的理解-1(user mobility understanding-1)
-用户移动规律的理解-2(user mobility understanding-2)
--用户移动规律的理解-2(user mobility understanding-2)
-用户画像与个人隐私-1(user profiling and privacy-1)
--用户画像与个人隐私-1(user profiling and privacy-1)
-用户画像与个人隐私-2(user profiling and privacy-2)
--用户画像与个人隐私-2(user profiling and privacy-2)
-Quiz
--Quiz--作业
-城市计算中的大数据研究简介(introduction to urban big data)
--城市计算中的大数据研究简介(introduction to urban big data)
-概念,框架和挑战(concepts,framework and chanlleges)
--概念,框架和挑战(concepts,framework and chanlleges)
-基础技术(fundamental techniques)
--基础技术(fundamental techniques)
-城市规划(urban planning)
-识别特定区域(indentify functional regions)
--识别特定区域(indentify functional regions)
-城市空气质量与大数据研究(urban air quality meets big data)
--城市空气质量与大数据研究(urban air quality meets big data)
-能源交通和环境污染(traffic energy and pollution)
--能源交通和环境污染(traffic energy and pollution)
-大数据在城市噪音处理中的应用(diagnose urban noise with big data)
--大数据在城市噪音处理中的应用(diagnose urban noise with big data)
-Quiz
--Quiz--作业
-软件分析的概念(the concepts of software analytics)
--软件分析的概念(the concepts of software analytics)
-软件分析的实例(examples of software analytics)
--软件分析的实例(examples of software analytics)
-传统的数据可视化(Traditional information visualization)
--传统的数据可视化(traditional information visualization)
-同质数据的可视化分析-1(Visual Analytics of Homogeneous Data-1)
--同质数据的可视化分析-1(Visual Analytics of Homogeneous Data-1)
-同质数据的可视化分析-2(Visual Analytics of Homogeneous Data-2)
--同质数据的可视化分析-2(Visual Analytics of Homogeneous Data-2)
-异质数据的可视化分析(Visual Analytics of Heterogeneous Data)
--异质数据的可视化分析(Visual Analytics of Heterogeneous Data)
-Quiz
--Quiz--作业