MAPO: Mining API Usages from Open Source Repositories
Related Publication:
Tao Xie and Jian Pei. MAPO: Mining API Usages from Open Source Repositories. In Proceedings of the 3rd
International Workshop on Mining Software Repositories (MSR 2006),
Shanghai, China, pp. 54-57, May 2006. [PDF][BibTeX][Slides]
This
project is to develop a tool to mine API usage out of partial source code (where
no complete source code is provided for compilation). For example, these source
code files can come from the searched results of a search engine of open source
projects: http://www.koders.com/. Of
course, the tool can mine information out of local complete source code. The
tool consists of three components
-
Source code collector: this component is to be
developed for collecting top N source code files returned by koders.com
given some search keywords, either features like "logging",
package names, class names, or method names. So far we manually download the
source code files from koders.com. We can also collect source code from some
downloaded open source projects.
-
Source code analyzer: given the collected Java source
files, this component analyzes each Java source file and produces a file
containing method-call sequences invoked by each method in the Java source
file. It also exports a single sequence database file that can be analyzed
by the BIDE sequence mining tool.
-
Usage pattern miner (BIDE): given the
sequence database, this component mines frequent usage patterns from
the sequence database. The BIDE tool is not publicly available and is
provided by the BIDE authors up request. An alternative frequent
sequence miner is SPAM, which can be downloaded http://himalaya-tools.sourceforge.net/. But note that then you need to adapt the miner input format described below, which is specific for BIDE miner.
Source code collector:
- So far we manually download the
source code files from www.koders.com.
Source code analyzer:
- Installation: The source code of this component can be downloaded
from this web. It is developed based on PMD,
a Java source code scanner. In order to run it, you need to download this
modified pmd jar file (source files are also included in the jar file) and add it to your classpath. In addition, you
need to download the following jar files and add them to your classpath: ant,
jaxen,
xercesImpl,
and xmlParser.
You also need to download this
ruleset zip and extract it to your local harddrive. Assume the directory
is C:\xtwork\pmd\pmd-3.4\rulesets.
- Usage: java net.sourceforge.pmd.PMD /path/to/source text c:\xtwork\pmd\pmd-3.4\rulesets\methodcalls.xml
e.g., java net.sourceforge.pmd.PMD
c:\xtwork\pmd\pmd-3.4\examples\BCELClassAnalyzer.java text
c:\xtwork\pmd\pmd-3.4\rulesets\methodcalls.xml
You can also analyze all the files under the same directory by specifying
the path to the source rather than a specific source file name:
e.g., java net.sourceforge.pmd.PMD c:\xtwork\pmd\pmd-3.4\examples text c:\xtwork\pmd\pmd-3.4\rulesets\methodcalls.xml
If you want to run the tool over a jar or zip file containing all the source
files, please refer to PMD's
usage documentation, which is still valid in our tool.
- Outputs: For each Java source file, in the same directory, you can
see four files. Assume you java source file is BCELAnalyzer.java, then you
will see four files:
BCELAnalyzer.java.woce: method sequences with inlined local method calls
BCELAnalyzer.java.woc: method sequenced without inlined local method calls;
so you can see local method calls like this.XXX in the sequences. This file
is for debugging use.
BCELAnalyzer.java.full: no used, eventually we will add control flow
information among method call sequences.
BCELAnalyzer.java.debug: include debug information, containing control flow
information.
At the moment, only BCELAnalyzer.java.woce file is useful for debugging.
In addition, it outputs the following files to be used for mining (for the
subject described below):
mcseq.txt: inputs to BIDE
mcseq.spec: inputs to BIDE
mcseq.map: mapping from method names to method
ids, which are used in mcseq.txt
mcseq.txt.debug: readable form of
mcseq.txt, for debugging
For example, the following is one line in mcseq.txt.debug (for method call
name representation, see below "*.woce file format"). Basically a
line lists the method calls invoked by a caller separated by a space. The
line ends with "-1." After the "-1," we also put the
caller name, which is not present in the mcseq.txt file. In the mcseq.txt
file, method calls are represented by their method IDs, whose mappings are
described in mcseq.map.
org.xml.sax.helpers.AttributesImpl,<init>
org.xml.sax.ContentHandler,startDocument
org.xml.sax.ContentHandler,startElement(4)
org.xml.sax.ContentHandler,characters(3)
org.xml.sax.ContentHandler,endElement(3)
org.xml.sax.ContentHandler,endDocument -1 generateLargeSAX(1)
@AbstractXMLTestCase.java.woc
The corresponding line in mcseq.txt:
0 1 2 3 4 5 -1
- *.woce file format: Each sequence is separated by an empty line.
Each sequence starts with a line that starts with "callers:" What
follows "callers:" is is the method name defined the Java source
code. Note that when you import the sequence into a sequence database, the
first line should be ignored. If a method has more than one parameter, the
method name is followed by "(PARAM_NUM)." Then the subsequent
lines list the method call sequences that are invoked within the method. The
naming of the method calls in the sequence is similar to above. But for each
method name in the sequence, we also include its package name separated by
"," from the method name.
caller: prepareMethodMap(1)
Class,getDeclaredMethods
org.apache.bcel.Repository,lookupClass(1)
org.apache.bcel.classfile.JavaClass,getMethods
org.apache.bcel.classfile.Method,getName
org.apache.bcel.generic.ArrayType,getDimensions
Class,equals(1)
org.apache.bcel.classfile.Method,getArgumentTypes
Class,equals(2)
caller: findAndAddBCELMethod(2)
org.apache.bcel.classfile.Method,getName
org.apache.bcel.generic.BasicType,equals(1)
...
You can download the
source file BCELAanalyzer.java and its four generated output files to
have a concrete idea on what they look like. You can also run the tool over
it as well as any other Java source files.
Some development notes for the source code analyzer can be found here.
Usage pattern miner (BIDE): Prepared by Jianyong Wang, Email: jianyong@tsinghua.edu.cn
(related but not in UIUC illimine,
BIDE is described in this
ICDE 04 paper)
An alternative frequent sequence miner is SPAM, which can be downloaded http://himalaya-tools.sourceforge.net/. But note that then you need to adapt the miner input format described below, which is specific for BIDE miner.
-
Installations: put the executable BIDE to a
directory that is specified in the system path environment variable.
-
Inputs: 1st argument: The specification file of
the dataset
2nd
argument: Relative support in decimal
Usage example: bide_with_output.exe mcseq.spec 0.5
Where bide
is the executable file name, bide_gaz.spec is the specification file of the
sequence dataset being mined, 0.5 is the relative support.
Specification file format: The first line is the dataset file
name, the second line is the number of unique items, the third line is the
number of sequences, the fourth line is the maximal length of a sequence,
and the fifth line is the average length of a sequence.
Dataset file format: Usually a sequence database consists of a series
of sequences (strictly speaking, here a sequence is a string in the current
implementation). Each line represents a sequence and ends with -1,
and the entire dataset ends with -2. Here is a sample sequence:
38 81 256 399 756 841 962 1009 -1
Example
datasets: mcseq.spec and mcseq.txt
-
Output: The discovered frequent sequences are printed into
a file called “frequent.dat”.
Each line in the result file, “frequent.dat”, contains a frequent
sequence in the form:
event1
event
2
… eventn : absolute support
Here is an example:
6
24 748 : 66
Example output: frequent.dat
- Frequent sequence postprocessor (a class included in the source code analyzer):
java net.sourceforge.pmd.rules.MethodCallsPostprocessor
DirectoryOfFrequentDat\frequent.dat
This produces a human readable form of frequent.data: frequent.data.txt
It also produces a file that contains the frequent patterns that start with
the same method call (note that so far we output only the first set of
frequent patterns that share the same method call): exampletrace.txt.
This file will be fed to kBehavior.
Subjects: