ase Mining Open Source Software Engineering Data           ase            

cose

[Project Descriptions]   [Tutorials/Presentations]   [Software]   [Related Projects]  [Related Events]  [Related Publications]



The Automated Software Engineering Research Group at North Carolina State University is conducting research on mining software engineering data, including open source software engineering data. Since late 90's, various data mining techniques have been applied to analyze software engineering data, and have achieved many noticeable successes. Substantial experience, development, and lessons of data mining for software engineering pose interesting challenges and opportunities for new research and development. Open source software development provides new opportunities for providing abundant software engineering data available for mining (e.g., open source code repositories, version histories, bug repositories, and community data). On the other hand, diversified data available there also pose new challenges. For example, the usage data of the same API from a large number of open source projects may be quite diversified (the same API may be used by different open source projects in very different ways). The bug reports directly submitted by the open source communities could be quite diversified (the same bug may be described by different users in text in very different ways). In addition, the open source communities create social dynamics that may be quite different from the ones created in the traditional software development process. 

Our research group has been helping develop the research area of mining software engineering data. Tao Xie has maintained a comprehensive bibliography on mining software engineering data. He presented tutorials on "Mining Software Engineering Data" at KDD 2006, ICDM 2007, ICSE 2007, ICSE 2008. He co-organized 2007 Dagstuhl Seminar on Mining Programs and Processes

We would like to hear from you if you are interested in collaborating with us on any idea in this project. You can contact us by sending email to .

highlight  How is our research work related to software industry?

Improving Software Productivity and Quality via Mining Program Source Code funded by NSF CSR, ARO STIR

The Yangtse Project on Automated Software Testing in the Absence of Specifications


Project Descriptions:

Mining Open Bug Repositories

Open source development projects typically support an open bug repository such as Bugzilla so that bug reports from all over the world can be gathered. A new report that is submitted to this repository must be triaged to detect a duplicate bug report, reproduce the problem, and assign the bug report to a developer. The current technique used by triagers for detecting duplicate reports is by searching for a target bug that matches the new one. But unfortunately the result is often not satisfying because of the limit of natural language processing. We propose an information-retrieval-based approach that combines execution information and natural language information to detect duplicate bugs more precisely. We also propose approaches for categorizing or clustering new bug reports based on the historical bug report categorization to reduce developers' efforts in categorizing new bug reports.

Mining Open Source Code Repositories

 

Mining Open Source Community Data

In open source project repositories, archived project communications record rationale for decisions throughout the life of a project. We propose to investigate social dynamics of open source developer or user communities by mining project communications and other types of data for the open source communities.


Tutorials/Course Modules:

  1. Tao Xie. Data Mining III - Text Mining course module. the Master of Science in Analytics (MSA) program, the Institute for Advanced Analytics, North Carolina State University, January-February 2008.
  2. Ahmed E. Hassan and Tao Xie. Mining Software Engineering Data. To appear in Proceedings of the 30th International Conference on Software Engineering (ICSE 2008), Companion Volume, TutorialsLeipzig, Germany, May 2008. [Tutorial Web][BibTeX]
  3. Chao Liu, Tao Xie, and Jiawei Han. Mining for Software Reliability. In Proceedings of the 2007 IEEE International Conference on Data Mining (ICDM 2007), Omaha, NE, October 2007. [Tutorial Web][BibTeX]
  4. Tao Xie, Jian Pei, and Ahmed E. Hassan. Mining Software Engineering Data. In Proceedings of the 29th International Conference on Software Engineering (ICSE 2007), Companion Volume, TutorialsMinneapolis, MN, pp. 172-173, May 2007. [Tutorial Web][PDF][BibTeX]
  5. Tao Xie and Jian Pei. Data Mining for Software Engineering. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006), Tutorial, Philadelphia, Pennsylvania, August 2006. [Tutorial Web][Slides][BibTeX]


Presentations:

  1. Tao Xie. Recommendation Systems for Code Reuse. Workshop talk, Bellairs Workshop On Software Analysis for Recommendation Systems (SARS 2008), Barbados, February, 2008. [Slides]
  2. Suresh Thummalapenta. PARSEWeb: A Programmer Assistant for Reusing Open Source Code on the Web. Conference presentation, the 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2007), Atlanta, Georgia, November 2007.
  3. Tao Xie. Improving Software Productivity and Quality via Mining Program Source Code. Invited talk, Accenture Labs, Chicago, IL, October 2007.
  4. Tao Xie. Improving Software Productivity and Quality via Mining Program Source Code. Invited talk, Motorola Labs, Schaumburg, IL, October 2007.
  5. Tao Xie. Improving Automation in Developer Testing: Achievements and Challenges. Conference talk, International Verify Conference (Verify 2007), Arlington, VA, October 2007. 
  6. Suresh Thummalapenta. Exploiting code search engines to improve programmer productivity. Conference ACM SIGPLAN SRC SRC presentation, the 21th Annual ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (Companion) (OOPSLA 2007),  ACM SIGPLAN Student Research Competition,  Montreal, Canada, October 2006.
  7. Tao Xie. Improving Automation in Developer Testing: Achievements and Challenges. Conference talk, Triangle Information Systems Quality Association Conference (TISQA 2007), Chapel Hill, NC, September 2007. 
  8. Mithun Acharya. Mining API Patterns as Partial Orders from Source Code: From Usage Scenarios to Specifications. Conference presentation,  the 6th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2007), Dubrovnik, Croatia, September, 2007.
  9. Tao Xie. Improving Software Productivity and Quality via Mining Program Source Code. Invited talk, Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV, September 2007.
  10. Tao Xie. Improving Programmer Productivity via Mining Program Source Code. Invited talk, Department of Computer Science and Engineering, Hong Kong University of Science and Technology, China, August 2007.
  11. Tao Xie. Improving Programmer Productivity via Mining Program Source Code. Invited talk, Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China, August 2007.
  12. Tao Xie. Mining Software Engineering Data. Invited talk, Software Engineering Institute, Peking University, Beijing, China, July 2007.
  13. Tao Xie. Improving Programmer Productivity via Mining Program Source Code. Invited talk, Department of Computer Science, University of Calgary, Canada, May 2007.
  14. Mithun Acharya. Mining Interface Specifications for Generating Checkable Robustness Properties. Conference presentation, the 17th IEEE International Conference on Software Reliability Engineering  (ISSRE 2006), Raleigh, NC, November 2006.
  15. Mithun Acharya. Automatic Inference of Interface Properties from Program Source Code. Conference doctoral symposium presentation, the 14th ACM SIGSOFT Symposium on Foundations of Software Engineering (FSE 2006), Doctoral Symposium, Portland, Oregon, USA, November 2006
  16. Mithun Acharya. Automatic Generation and Inference of Interface Properties from Program Source Code. Conference ACM SIGPLAN SRC SRC presentation, the 20th Annual ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (Companion) (OOPSLA 2006), ACM SIGPLAN Student Research Competition, Portland, Oregon, USA, October 2006.
  17. Mithun Acharya. Effective Generation of Interface Robustness Properties for Static Analysis. Conference poster presentation, the 21st IEEE/ACM International Conference on Automated Software Engineering  (ASE 2006), Tokyo, Japan, September 2006.
  18. Mithun Acharya. Automatic Generation of Robustness and Security Properties from Program Source Code. Conference student forum presentation, the IEEE International Conference on Dependable Systems and Networks (DSN 2006), Student Forum, Philadelphia, PA, USA, June 2006
  19. Tao Xie. Data Mining for Software Engineering. Visit talk, Fudan University, China, May 2006.
  20. Tao Xie. MAPO: Mining API Usages from Open Source Repositories. Workshop presentation, the 3rd International Workshop on Mining Software Repositories (MSR 2006), Shanghai, China, May 2006. [Slides]


Software:

  1. NEGWeb: Static Defect Detection via Searching Billions of Lines of Open Source Code
  2. PARSEWeb: A Programmer Assistant for Reusing Open Source Code on the Web
  3. MAPO: Mining API Usages from Open Source Repositories
  4. UnitPlus: Assisting Developer Testing in Eclipse

Links:

Bibliography on Mining Software Engineering Data


Related Research Projects on Open Source Software Development

Related Courses on Open Source Software Development

Related Research Events on Open Source Software Development


 

Related Publications: (Software Engineering Conferences) (Software Testing Researchers) Also see Tao Xie's publications.

Research Foundations Research Subareas

 

SPONSORS

National Science Foundation Award CNS-0720641, Computer Systems Research (CSR) Program (08/01/2007-07/31/2008)

Army Research Office Award W911NF-07-1-0431, Short Term Innovative Research (STIR) Program (06/18/2007-03/17/2008)

 

ase