软件分析的概念（the concepts of software analytics）慕课视频播放-微软亚洲研究院大数据系列讲座-MOOC慕课视频教程-柠檬大学

Hi, welcome to our lecture on software analytics using big data

My name is Hongyu Zhang. I'm a researcher at Microsoft research.

This lecture is a part of the big data course

offered by Microsoft research Asia.

Over the years, there is a large

amount of data produced by the software engineering practice.

The aim of software analytics is to

utilize this data to improve software practice.

In this lecture, I will show you the concepts of software analytics

I will also give some examples of software analytics tasks.

Thank you for watching!

We are now in an era of big data. Data is surrounding us and is everywhere

We also generate a large amount of data every minute.

For example, people generate about two hundred million emails every minute

and about four billion Google queries.

We can see that,

we generate a large amount of data so we could

utilize this data. We could mine this data.

to help us better understand ourselves,

our customers. And we could then

build better tools to improve the quality of our life and

to improve the productivity of our work.

In this era we also see that dramatic change of software.

Software systems are no longer shipped with a CD-ROM in a box

In stead, more and more software systems

becomes large-scale online systems.

For example, Microsoft office is now online, which is called Office 365. Subscribed

users can access office features any time, anywhere.

Nowadays the software systems are also

getting larger and more and more complex.

The way software is built and operated

is also changing. Traditionally, software is built following a

Waterfall model by a small group of developers.

Now, the development of software systems

is increasingly distributed, and

agile for some open source systems.

Developers all around the world

could contribute to the code and fix bugs.

Over the years of software development,

a vast amount of software engineering data has been produced.

This data could come from long-term evolution of

industrial software systems.

It could also come from open-source projects such as those from GitHub.

There are many kinds of software engineering data,

for example, source code, crash reports,

bugs, metrics, logs,

Customer information, and so on. The amount of data keeps increasing.

As an example, we show the growth of source code size for eclipse.

Eclipse is a widely used open-source IDE.

It has gone through a long period of evolution. In 2001,

Eclipse 1.0 was released. After that,

It has been a major release almost every year.

The figure in this slide shows the growth of eclipse source code size.

meatured in terms of the number of java files

number of compiled classes, and the lines of code.

We can see that the size of eclipse goes up quickly.

Eclipse 1.0 has about half a million lines of code,

while Eclipse 3.3 has nearly 2,000,000 lines of code.

The average growth has been 260,000 length of code

for each major release.

Over the years, a large amount of source code data has been generated.

The other example of software engineering data is error report.

Also a lot of software quality assurrance mesures have been

taken. Released products still contain bugs.

Many bugs could lead to software crash.

Once a crash happens, users could choose to send the crash report

to the product developers. Microsoft has deployed a system.

Which can collect and analyze

crash reports sent from windows users.

This system is called windows

error reporting system, or WER for short.

It was reported that about

1,000,000 error reports were collected within

8 months of WER deployment. The number of error reports

processed by WER increased by a factor of 30

from 2003 to 2009. We can see that

over the years a large amount of

error report data has being generated.

Another example of software engineering data is

project data. Github is a popular open-source project hosting website.

It was founded in 2008.

As of January 2013, Github has already

received 2,000,000 users and hosted

4,900,000 projects. By december 2013,

Github contains about ten million projects in total.

Each project has project unique data

such as project summary, creation date, contributors, modification history,

programming language used and so on. We can see that over the years,

a large amount of project data has being generated.

This software engineering data can be largely classified into three types.

The data related to the operation of software.

The data related to the users of software and the data related to

the product itself. The software operation data includes

runtime traces, program logs, events,

performance counters and so on. These data are

generated from the daily operation of the software.

The software user data includes the data from the users.

For example, the usage log, the user surveys,

online posts, blogs and so on.

The software product data includes source code,

bug reports, check-in history

test cases and so on. These data

are generated from the product and process of the software.

There are many data sources from which we could collect our data.

for example, the source code repositories

store changes to the code from which we could

collect the source code changes.

The defect tracking systems store the bug related information and

track the resolution of bugs from the defect tracking systems

we could collect the defect related data.

The emails recall the project communications among

the developers from which we could better understand

the rationales for decisions throughout the life of a project.

The error reporting system collects the crash

related data. The metric tools can collect

all the process and product related data.

The project hosting website cantains the meta project data.

All these data can be collected automatically and can be used

for improving software process and products.

The goal of software analytics is to enable software

practitioners perform data

explorasion and analysis in order to obtain insightful and actionable

knowledge for real world tasks.

We have a large amount of data generated over the years.

We could mine these data and to obtain insightful and

actionable knowledge. Then, we could build better tools,

and make better decisions in order to improve the quality of the

software and the productivity of the software development process.

Software analytics covers different aspects of software development.

Throughout the entire software life cycle, we can collect

data related to software systems such as the source code,

bug report,execution logs, code complexity metrics,

peformance counters, and so on.

We can then analyze these data to obtain insights about

the quality status of the software system. We can also

collect data from software development process.

From the evolution process

For example, the data about software process,

and the data about cost and effort , developer data,

source code change data, and so on. We can then analyze these data to

obtain insights of the productivity of the developers.

We can also collect data from software

users such as click-through data, usage data, location data,

user profile, and so on. We can then analyze these data to obtain insights

about the actual use of the software.

All the obtained insights can be used to help software

practitioners produce better software.

Software analytics utilize many technologies.

we call them them technology pillars. One technology pillar

is big data computing, which leverages the power of machines.

We could use Microsoft Azure, Apache Hadoop,

Spark, elastic search,

to process a large amount of structure and

unstructured data in an efficient way.

The other technology pillar is about information visualization,

which leverages the power of human. We human are good at visualization.

and can quickly identify patterns from the images or charts.

Another technology pillar is related to data mining,

machine learning and information retrieval, which connects human

and machines. We can apply classification, clustering, ranking, regression.

All such techniques

to mine knowledge from the software engineering data.

Software analytics can help software developers

and testers to improve their productivity.

It can also help other software practitioners such as

program managers, support engineers and operation engineers to better

understand the software engineering data they collecte, to help them

obtain more insights and make better decisions they can use the results of

software analytics to verify, reject some

hypothosis, discover some new insights.

build classification of prediction models

or construct knowledge base and query against it.

In the rest of the this talk I will

show some example of software anlylitics tasks.

微软亚洲研究院大数据系列讲座课程列表：

第一讲：大数据研究现状及未来趋势（洪小文）

-什么是大数据(What is big data?)

--什么是大数据(What is big data?)

-为什么大数据是当前热点（Why big data is a nature phenomenon?)

--为什么大数据是当前热点（Why big data is a nature phenomenon?)

-新的计算基础设施和工具(New Infrastructure and tools)

--新的计算基础设施和工具(New Infrastructure and tools)

-课程简介(Course Introduction)

--课程简介(Course Introduction)

-基础设施，机器学习和可视化（Infrastructure,Machine Learning and Visualization)

--基础设施，机器学习和可视化（Infrastructure,Machine Learning and Visualization)

-大数据与传统商业智能的区别（Big data:different from traditional BI)

--大数据与传统商业智能的区别（Big data:different from traditional BI)

-Quiz

--Quiz--作业

第二讲：互联网搜索中的大数据研究（宋睿华）

-大规模超文本网络搜索引擎的解析(the anatomy of a large scale hypertextual web search engine)

--大规模超文本网络搜索引擎的解析(the anatomy of a large scale hypertextual web search engine)

-搜索引擎如何实现每秒数千次的查询(How does a web search engine process thousands of queries per second?)

--搜索引擎如何实现每秒数千次的查询(How does a web search engine process thousands of queries per second?)

-探寻搜索的多个维度(finding dimensions for queries)

--探寻搜索的多个维度(finding dimensions for queries)

-Quiz

--Quiz--作业

第三讲：社会计算中的大数据研究（谢幸）

-背景介绍(background)

--背景介绍(background)

-用户移动规律的理解-1(user mobility understanding-1)

--用户移动规律的理解-1(user mobility understanding-1)

-用户移动规律的理解-2(user mobility understanding-2)

--用户移动规律的理解-2(user mobility understanding-2)

-用户画像与个人隐私-1(user profiling and privacy-1)

--用户画像与个人隐私-1(user profiling and privacy-1)

-用户画像与个人隐私-2(user profiling and privacy-2)

--用户画像与个人隐私-2(user profiling and privacy-2)

-Quiz

--Quiz--作业

第四讲：城市计算中的大数据研究（上）（郑宇）

-城市计算中的大数据研究简介（introduction to urban big data）

--城市计算中的大数据研究简介（introduction to urban big data)

-概念，框架和挑战（concepts,framework and chanlleges）

--概念，框架和挑战（concepts,framework and chanlleges)

-基础技术（fundamental techniques）

--基础技术（fundamental techniques)

-城市规划（urban planning）

--城市规划（urban planning)

第四讲：城市计算中的大数据研究（下）（郑宇）

-识别特定区域（indentify functional regions）

--识别特定区域（indentify functional regions)

-城市空气质量与大数据研究（urban air quality meets big data）

--城市空气质量与大数据研究（urban air quality meets big data)

-能源交通和环境污染（traffic energy and pollution）

--能源交通和环境污染（traffic energy and pollution)

-大数据在城市噪音处理中的应用（diagnose urban noise with big data）

--大数据在城市噪音处理中的应用（diagnose urban noise with big data)

-Quiz

--Quiz--作业

第五讲：软件分析中的大数据研究（张洪宇）

-软件分析的概念（the concepts of software analytics）

--软件分析的概念（the concepts of software analytics）

-软件分析的实例（examples of software analytics）

--软件分析的实例（examples of software analytics）

第六讲：大数据分析可视化研究（刘世霞）

-传统的数据可视化（Traditional information visualization）

--传统的数据可视化（traditional information visualization）

-同质数据的可视化分析-1（Visual Analytics of Homogeneous Data-1）

--同质数据的可视化分析-1（Visual Analytics of Homogeneous Data-1）

-同质数据的可视化分析-2（Visual Analytics of Homogeneous Data-2）

--同质数据的可视化分析-2（Visual Analytics of Homogeneous Data-2）

-异质数据的可视化分析（Visual Analytics of Heterogeneous Data）

--异质数据的可视化分析（Visual Analytics of Heterogeneous Data）

-Quiz

--Quiz--作业

软件分析的概念（the concepts of software analytics）在线视频