Our research group has been
helping develop the research area of
mining software engineering data. Tao Xie
has maintained a comprehensive bibliography on mining
software engineering data. He presented tutorials on "Mining
Software Engineering Data" at KDD 2006,
ICDM 2007, ICSE 2007,
ICSE 2008.
He co-organized 2007
Dagstuhl
Seminar on Mining Programs and Processes.
We would like to hear from you if you are interested in collaborating
with us on any idea in this project. You can contact us by sending
email to
.
How
is our research work
related to software industry?
Improving
Software Productivity and Quality via Mining Program Source Code funded by NSF
CSR, ARO
STIR
The
Yangtse Project
on Automated
Software Testing in the Absence of Specifications
Project
Descriptions:
Mining Open Bug Repositories
Open source development projects typically support an open bug
repository such as Bugzilla
so that bug reports from all over the world can be gathered.
A new report that is submitted to this repository must be triaged to
detect a duplicate bug report, reproduce the problem, and assign the
bug report to a developer. The current technique used by triagers for
detecting duplicate reports is by searching for a target bug
that
matches the new one. But unfortunately the result is often not
satisfying because of the limit of natural language processing. We
propose an information-retrieval-based approach that combines execution
information and natural language information to detect duplicate bugs
more precisely. We also propose approaches for categorizing or
clustering new bug reports based on the historical bug report
categorization to reduce developers' efforts in categorizing new bug
reports.
-
[ICSE 2008] Xiaoyin
Wang, Lu Zhang, Tao Xie, John Anvik, and Jiasu Sun. An Approach to Detecting
Duplicate Bug Reports using Natural Language and Execution Information.
To appear in Proceedings
of the 30th International Conference on Software Engineering
(ICSE 2008),
Leipzig, Germany, May 2008. [PDF][BibTeX]
Mining Open Source Code Repositories
-
Mining API Properties for Bug Finding:
A software system interacts with its environment through interfaces.
Improper handling of exceptional returns from system interfaces can cause
robustness problems. Robustness of software systems are governed by various
temporal properties related to interfaces. Static verification has been
shown to be effective in checking these temporal properties. But manually
specifying these properties is cumbersome and requires the knowledge of
interface specifications, which are often either unavailable or
undocumented. We propose a framework to automatically infer
system-specific interface specifications from program source code written in
C. We use a model checker to generate traces related to the interfaces. From
these model checking traces, we infer interface specification details such
as return value on success or failure. Based on these inferred
specifications, we translate generically specified interface robustness
rules to concrete robustness properties verifiable by static checking. We
implemented our framework for an existing static analyzer called MOPS
that employs push-down model checking and apply the analyzer to the well
known POSIX-API system interfaces. We have recently developed several Eclipse Plugins based on Google
Code Search Engine for inferring API properties for Java code
returned from code search engines. These properties are used to find bugs
related to neglected conditions and exception handling. More Eclipse Plugins for
C, C++, and C# are under development.
-
[NSFNGS
2008] Tao Xie, Mithun Acharya, Suresh Thummalapenta, and
Kunal Taneja. Improving Software
Reliability and Productivity via Mining Program Source Code. To
appear in Proceedings of the NSF
Next Generation Software Program Workshop at IPDPS 2008 (NSFNGS
2008), Miami, Florida, April 2008. [PDF][BibTeX]
-
[NCSU CSC 07] Mithun Acharya and Tao Xie. Static
Detection of API Error-Handling Bugs via Mining Source Code.
North Carolina State University Department of Computer Science Technical
report TR-2007-35, October 15, 2007. [PDF][BibTex]
-
[NCSU CSC 07] Suresh Thummalapenta and Tao Xie. NEGWeb:
Static Defect Detection via Searching Billions of Lines of Open Source
Code. North Carolina State University Department of Computer
Science Technical report TR-2007-24, September 16, 2007. [PDF][BibTex]
-
[ESEC/FSE 2007] Mithun Acharya, Tao Xie, Jian Pei,
and Jun Xu. Mining API Patterns as
Partial Orders from Source Code: From Usage Scenarios to Specifications.
In Proceedings of the 6th joint
meeting of the European Software Engineering Conference and the ACM
SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE
2007), Dubrovnik, Croatia, pp. 25-34, September, 2007. [PDF][BibTeX]
-
[ISSRE 2006] Mithun Acharya, Tao Xie, and Jun Xu.
Mining Interface Specifications for Generating Checkable Robustness
Properties. In Proceedings of the 17th IEEE International
Conference on Software Reliability Engineering (ISSRE
2006), Raleigh, NC, pp. 311-320, November 2006. [PDF][BibTeX]
-
[ASE 2006] Mithun Acharya, Tanu Sharma, Jun Xu,
and Tao Xie. Effective Generation of Interface Robustness
Properties for Static Analysis. In Proceedings of the 21st
IEEE/ACM International Conference on Automated Software Engineering (ASE
2006), Short Paper, Tokyo, Japan, pp. 293-296, September 2006. [PDF][BibTeX]
-
Mining API
Usage Patterns for Code Reuse: Modern software industry
increasingly
relies on third-party libraries and frameworks provided by open source
projects. Everyday, programmers cope with these APIs to accomplish
their daily work. Unfortunately, most of the APIs are complex and
difficult to use. Typically, an API library often provides a large
number of methods and classes. For example, Eclipse 3.1 platform SDK
provides more than 11,000 classes not to say its large community of
plug-in projects. In addition, the APIs provided by different projects
follow different styles. Even experienced programmers
may encounter problems when they are to use unfamiliar APIs. Due to
these difficulties, programmers are
often struggling with how to choose proper APIs and how to organize the
APIs when the programmers need to use them together to implement a
certain feature. We propose to develop techniques and tools for mining
API usage patterns out of open source code repositories to help
programmers write API client code. We have developed a
prototype called MAPO
based on code search engines such as Google Code Search
Engine. More details can be found in our MSR
2006 paper.
We recently developed an Eclipse Plugin version of MAPO, focusing on a
fixed set of open source repositories.
We also developed another Eclipse Plugin for mining API sequences based
on Google Code
Search Engine (improving an Eclipse Plugin called Prospector
developed at UC Berkeley by mining a large number of open source
projects through code search engines).
Related Publications: [ESEC/FSE
2007][MSR
2006][ICSE
06 ER]
-
[NSFNGS
2008]
Tao
Xie, Mithun Acharya, Suresh Thummalapenta, and Kunal Taneja. Improving Software Reliability
and Productivity via Mining Program Source Code. To
appear in Proceedings of
the NSF Next Generation Software
Program Workshop at IPDPS 2008 (NSFNGS
2008), Miami, Florida, April
2008. [PDF][BibTeX]
- [MSR 2008] Suresh Thummalapenta and Tao Xie. SpotWeb:
Detecting Framework Hotspots via Mining Open Source Repositories on the
Web. To appear in Proceedings of the
5th Working Conference on Mining Software Repositories (MSR
2008), Position Paper, Leipzig, Germany, May 2008. [PDF][BibTeX]
- [ASE
2007] Suresh Thummalapenta and Tao Xie. PARSEWeb:
A Programmer Assistant for Reusing Open Source Code on the Web. In Proceedings
of the 22nd IEEE/ACM International Conference on Automated Software
Engineering (ASE 2007),
Atlanta, Georgia, pp. 204-213, November 2007. [PDF][BibTeX]
- [ESEC/FSE 2007] Mithun Acharya, Tao Xie, Jian Pei,
and Jun Xu. Mining API Patterns as
Partial Orders from Source Code: From Usage Scenarios to Specifications. In
Proceedings of the 6th joint meeting of
the European Software Engineering Conference and the ACM SIGSOFT Symposium
on the Foundations of Software Engineering (ESEC/FSE
2007), Dubrovnik, Croatia, pp. 25-34, September, 2007. [PDF][BibTeX]
- [MSR 2006] Tao Xie and Jian Pei. MAPO: Mining
API Usages from Open Source Repositories. In Proceedings of the 3rd
International Workshop on Mining Software Repositories (MSR
2006), Shanghai, China, pp. 54-57, May 2006. [PDF][BibTeX][Slides]
- Mining Test Code for Assisting Developer
Testing: When developers build a
software system based on third-party open source libraries and
frameworks, some class interfaces in the system may have method
arguments of non-primitive types from the third-party libraries and
frameworks. During automated unit test generation, a test generation
tool usually has difficulties in generating meaningful arguments for
these non-primitive-types arguments. We propose to mine method
sequences that produce objects of the non-primitive types, and use
these method sequences to produce meaningful arguments as test data.
The high-level, general idea is to exploit the code written by the open
source communities to help developers to test the code at hand.
- Yoonki Song, Suresh
Thummalapenta, and Tao Xie. UnitPlus:
Assisting Developer Testing in Eclipse. In Proceedings
of the Eclipse Technology
eXchange Workshop at OOPSLA 2007 (ETX
2007), Montreal, Canada, October 2007. (Best
Student Paper Award) [PDF][BibTeX]
- Mining Open
Source Version Histories: Version control systems such as CVS or SVN
track the evolution of source code in a software project. We propose to
investigate the evolution of test code and co-evolution of test code
and production code.
Mining Open
Source Community Data
In open source project repositories, archived project
communications record rationale for decisions throughout the life of a
project. We propose to investigate social dynamics of open source
developer or user communities by mining project communications and
other types
of data for the open source communities.
Tutorials/Course
Modules:
-
Tao Xie. Data
Mining III - Text Mining course module. the Master
of Science in Analytics (MSA) program, the
Institute for Advanced Analytics, North Carolina State University,
January-February 2008.
-
Ahmed E. Hassan and Tao Xie. Mining
Software Engineering Data. To appear in Proceedings of the 30th
International Conference on Software Engineering (ICSE
2008), Companion Volume, Tutorials, Leipzig,
Germany, May 2008. [Tutorial Web][BibTeX]
- Chao Liu, Tao Xie, and Jiawei Han. Mining
for Software Reliability. In Proceedings of the 2007 IEEE
International Conference on Data Mining (ICDM
2007), Omaha, NE, October 2007. [Tutorial
Web][BibTeX]
- Tao Xie, Jian Pei, and Ahmed E. Hassan. Mining
Software Engineering Data. In Proceedings of the 29th
International Conference on Software Engineering (ICSE
2007), Companion Volume, Tutorials, Minneapolis,
MN, pp. 172-173, May 2007. [Tutorial
Web][PDF][BibTeX]
- Tao Xie and Jian Pei. Data Mining for Software Engineering. In Proceedings
of the 12th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining (KDD 2006), Tutorial,
Philadelphia, Pennsylvania, August 2006. [Tutorial
Web][Slides][BibTeX]
Presentations:
- Tao Xie. Recommendation Systems for Code Reuse. Workshop talk, Bellairs
Workshop On Software Analysis for Recommendation Systems (SARS
2008), Barbados, February, 2008. [Slides]
- Suresh Thummalapenta. PARSEWeb: A Programmer Assistant for Reusing Open
Source Code on the Web. Conference presentation, the
22nd IEEE/ACM International Conference on Automated Software Engineering
(ASE 2007), Atlanta, Georgia,
November 2007.
- Tao Xie. Improving Software Productivity and Quality via Mining Program
Source Code. Invited talk, Accenture Labs, Chicago, IL, October 2007.
- Tao Xie. Improving Software Productivity and Quality via Mining Program
Source Code. Invited talk, Motorola Labs, Schaumburg, IL, October 2007.
- Tao Xie. Improving Automation in Developer Testing: Achievements and
Challenges. Conference talk, International Verify Conference (Verify
2007), Arlington, VA, October 2007.
- Suresh Thummalapenta. Exploiting code search engines to improve programmer
productivity. Conference ACM SIGPLAN SRC SRC presentation, the
21th Annual ACM SIGPLAN International Conference on Object-Oriented
Programming, Systems, Languages, and Applications (Companion) (OOPSLA
2007), ACM
SIGPLAN Student Research Competition, Montreal, Canada,
October 2006.
- Tao Xie. Improving Automation in Developer Testing: Achievements and
Challenges. Conference talk, Triangle Information Systems Quality
Association Conference (TISQA
2007), Chapel Hill, NC, September 2007.
- Mithun Acharya. Mining API Patterns as Partial Orders from Source Code:
From Usage Scenarios to Specifications. Conference presentation, the
6th joint meeting of the European Software Engineering Conference and the
ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE
2007), Dubrovnik, Croatia, September, 2007.
- Tao Xie. Improving Software Productivity and Quality via Mining Program
Source Code. Invited talk, Lane Department of Computer Science and
Electrical Engineering, West Virginia University, Morgantown, WV,
September 2007.
- Tao Xie. Improving Programmer Productivity via Mining Program Source Code.
Invited talk, Department of Computer Science and Engineering, Hong Kong
University of Science and Technology, China, August 2007.
- Tao Xie. Improving Programmer Productivity via Mining Program Source Code.
Invited talk, Department of Computer Science and Engineering, The Chinese
University of Hong Kong, Hong Kong, China, August 2007.
- Tao Xie. Mining Software Engineering Data. Invited talk, Software
Engineering Institute, Peking University, Beijing, China, July 2007.
- Tao Xie. Improving Programmer Productivity via Mining Program Source Code.
Invited talk, Department of Computer Science, University of Calgary, Canada,
May 2007.
- Mithun Acharya. Mining Interface Specifications for Generating Checkable
Robustness Properties. Conference presentation, the 17th IEEE
International Conference on Software Reliability Engineering (ISSRE
2006), Raleigh, NC, November 2006.
- Mithun Acharya. Automatic Inference of Interface Properties from Program
Source Code. Conference doctoral symposium presentation, the
14th ACM SIGSOFT Symposium on Foundations of Software Engineering (FSE
2006), Doctoral
Symposium, Portland, Oregon, USA, November 2006
- Mithun Acharya. Automatic Generation and Inference of Interface Properties
from Program Source Code. Conference ACM SIGPLAN SRC SRC presentation, the
20th Annual ACM SIGPLAN International Conference on Object-Oriented
Programming, Systems, Languages, and Applications (Companion) (OOPSLA
2006), ACM
SIGPLAN Student Research Competition, Portland, Oregon, USA, October
2006.
- Mithun Acharya. Effective Generation of Interface Robustness Properties
for Static Analysis. Conference poster presentation, the 21st
IEEE/ACM International Conference on Automated Software Engineering (ASE
2006), Tokyo, Japan, September 2006.
- Mithun Acharya. Automatic Generation of Robustness and Security Properties
from Program Source Code. Conference
student forum presentation, the IEEE
International Conference on Dependable Systems and Networks (DSN
2006),
Student Forum, Philadelphia, PA, USA, June 2006
- Tao Xie. Data Mining for Software Engineering. Visit talk, Fudan
University, China, May 2006.
- Tao Xie. MAPO: Mining API Usages from Open Source Repositories. Workshop
presentation, the 3rd International
Workshop on Mining Software Repositories (MSR
2006), Shanghai, China, May
2006. [Slides]
Software:
- NEGWeb: Static Defect
Detection via Searching Billions of Lines of Open Source Code
- PARSEWeb: A
Programmer Assistant for Reusing Open Source Code on the Web
- MAPO: Mining
API Usages from Open Source Repositories
- UnitPlus: Assisting Developer Testing in Eclipse
Links:
Bibliography on Mining Software
Engineering Data
Related Research Projects on
Open Source Software Development
Related Courses on Open Source Software
Development
Related Research Events on
Open Source Software Development
Related Publications:
(Software
Engineering Conferences) (Software
Testing Researchers) Also see
Tao Xie's publications.
| Research
Foundations |
Research
Subareas |
|
|
|
SPONSORS
National Science Foundation Award CNS-0720641,
Computer Systems Research (CSR) Program (08/01/2007-07/31/2008)
Army
Research Office Award W911NF-07-1-0431,
Short Term Innovative Research (STIR) Program (06/18/2007-03/17/2008)
