
Mining Software Engineering Data
Research Experiences for Undergraduates
[Summary] [People] [Tasks] [Links] [Sponsor]
PROJECT SUMMARY
Since late 90's,
various data mining techniques have been applied to analyze software
engineering data, and have achieved many noticeable successes. Substantial
experience, development, and lessons of data mining for software engineering
pose interesting challenges and opportunities for new research and
development. This project investigates techniques and tools for mining
software engineering data such as bug reports, developer mailing lists, API documents, and code comments.
PEOPLE
Faculty Mentor
Tao Xie (Principal
Investigator)
Graduate Student Co-Mentor
Suresh Thummalapenta (Ph.D. Student)
Undergraduate Students
Justin W. Gorham (Jan 08-)
Anjali Khatri (April 08-)
Robinson N. Udechukwu (April 08-)
RESEARCH TASKS
- Write research log/blog in our ASE REU blog
during the process of working on this project in documenting your research
activities, the questions that you have, and the difficulties that
you face, etc.
- Browse and play around cool applications of text mining
- Subtasks 1 and 2 are conducted concurrently and Subtask 3 is conducted after Subtasks 1 and 2
Research Subtasks
- Subtask 1: Learn how to use SAS Text Miner (if your local
machine is reasonably performing well, i.e., with enough CPU, memory,
and harddrive, you can consider to install SAS 9.1 in your local machine to use instead of using VCL. Drop Dr. Xie an email if you plan to do so)
- Step 1: Use NCSU VCL to configure the environment of SAS
v9.1.3 SP4 (WinXP)
- Step 2: Log in to VCL with remote desktop.
- Step 3: Add C:\Program Files\SAS\SAS 9.1\tmine\sasexe to your PATH system variable (i.e., Start->All
Programs->Control Panel->System, Advance tab, Click Environment
Variables, add the preceding path to the Path system variable by
putting "; " before the appended path)
-
Step 4: Inside VCL, log in to Wolfcall with your unity id and password
- Step 5: Start
SAS 9.1 5 (e.g., Start->All Programs->SAS->SAS 9.1). When you
see a dialog with "Getting Started with SAS", click the Close button.
- Step 6: In SAS 9.1's menu Solutions->Analysis->Enterprise Miner. When you see a dialog with "Start Tutorials?", click the Close button.
- Step 7: Follow the instructions in Getting Started with SAS 9.1 Text Miner
- You can skip Chapter 3 but you can feel free to give it a try
if you would like to. Mastering the techniques described in Chapter 3
can allow you to import any textual data (like your email, your
documents in your machine, ...) into SAS by yourself to apply text
mining on these textual data.
- Subtask 2: Learn general knowledge about text mining
- Subtask 3: Apply SAS Text Miner on Software Engineering Data including bug reports and developer mailing lists.
- SE Data of developer mailing lists is available here. The "ReadMe.txt" file inside provides the necessary instructions for using developer mailing lists information. The current data is a sample data and a larger dataset is available here.
- Bug report is available here. Each folder in the zip file contains a set of bug reports. The users can reuse the SAS scripts provided for the developer mailing lists.
LINKS
SAS Text Miner Basics
- The variables in the data set are generally divided into four categories: identification, input, target, and rejected.
- A rejected variable is not used in any data mining analysis.
- An input (also called an independent variable) is used in the various models and exploratory tools in the software.
- Often, input variables are used to predict a target value (also called a dependent value).
- An
identification variable is used to label a particular observation in
the data set. While they are not used to predict outcomes,
identification variables are used to link different observations to the
same ID. Some data mining techniques require a target variable while
others need only input variables.
- There are other categories of variables as well that are used for more specialized analyses. In particular, a time ID variable is used in place of the more standard ID variable when the data are tracked longitudinally. A text variable is used with SAS Text Miner software. A raw data set is used to perform the initial analyses, unless the data are longitudinal, in which case the data set is identified as transactional.
- Test Data, Training Data, and Validation Data.
- Test
data: currently available data that contains input values and target values
that are not used during training, but which instead are used for
generalization and to compare models. See also training data and test data.
- Training
data: currently available data that
contains input values and target values that are used for model training. See
also test data and validation data.
- Validation data: data that is used to validate
the suitability of a data model that was developed using training data. Both
training data sets and validation data sets contain target variable values.
Target variable values in the training data are used to train the model. Target
variable values in the validation data set are used to compare the training
model’s predictions to the known target values, assessing the model’s fit before
using the model to score new data. See also test data and training data.
SPONSOR
College of Engineering and Department of Computer Science, North Carolina State University (01/24/2008-05/16/2008)