JYuan Learning Log: Notes on Apache UIMA

http://uima.apache.org/d/uimaj-2.6.0/references.html

The feature's rangeTypeName specifies the type of value that the feature can take. This may be the name of any type defined in your type system, or one of the predefined types. All of the predefined types have names that are prefixed with uima.cas or uima.tcas, for example:

uima.cas.TOP 
uima.cas.String
uima.cas.Long 
uima.cas.FSArray
uima.cas.StringList
uima.tcas.Annotation.

For a complete list of predefined types, see the CAS API documentation.

The elementType of a feature is optional, and applies only when the rangeTypeName is uima.cas.FSArray or uima.cas.FSList TheelementType specifies what type of value can be assigned as an element of the array or list. This must be the name of a non-primitive type. If omitted, it defaults to uima.cas.TOP, meaning that any FeatureStructure can be assigned as an element the array or list. Note: depending on the CAS Interface that you use in your code, this constraint may or may not be enforced. Note: At run time, the elementType is available from a runtime Feature object (using the a_feature_object.getRange().getComponentType() method) only when specified for theuima.cas.FSArray ranges; it isn't available for uima.cas.FSList ranges.

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.uima/textmarker-core/2.0.0/org/apache/uima/textmarker/engine/InternalTypeSystem.xml

       <featureDescription>
          <name>rules</name>
          <description/>
          <rangeTypeName>uima.cas.FSArray</rangeTypeName>
          <elementType>org.apache.uima.textmarker.type.DebugRuleMatch</elementType>
        </featureDescription>

Sofa: Subject of Analysis

Check C:\Users\usera\uima.log
Running The UIMA Analysis Example
uima PEAR Installer User's Guide
runPearInstaller.bat then cvd.bat
To launch the PEAR Installer, use the script in the UIMA bin directory: runPearInstaller.bat or runPearInstaller.sh.
http://uima.apache.org/downloads/sandbox/simpleServerUserGuide/simpleServerUserGuide.html
https://uima.apache.org/downloads/releaseDocs/2.3.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html
An Analysis Engine (AE) is a program that analyzes artifacts (e.g. documents) and infers information from them.
An Analysis Engine (AE) may contain a single annotator (this is referred to as a Primitive AE), or it may be a composition of others and therefore contain multiple annotators (this is referred to as an Aggregate AE). Primitive and aggregate AEs implement the same interface and can be used interchangeably by applications.

Annotators produce their analysis results in the form of typed Feature Structures, which are simply data structures that have a type and a set of (attribute, value) pairs. An annotation is a particular type of Feature Structure that is attached to a region of the artifact being analyzed (a span of text in a document, for example).

It is also possible for annotators to record information associated with the entire document rather than a particular span (these are considered Feature Structures but not Annotations).
All feature structures, including annotations, are represented in the UIMA Common Analysis Structure(CAS). The CAS is the central data structure through which all UIMA components communicate.

Defining Types
The first step in developing an annotator is to define the CAS Feature Structure types that it creates. This is done in an XML file called a Type System Descriptor. UIMA defines basic primitive types as well as Arrays of these primitive types. UIMA also defines the built-in types TOP, which is the root of the type system, analogous to Object in Java; FSArray, which is an array of Feature Structures (i.e. an array of instances of TOP); and Annotation

The built-in Annotation type declares three fields (called Features in CAS terminology). The features begin and end store the character offsets of the span of text to which the annotation refers. The feature sofa (Subject of Analysis) indicates which document the begin and end offsets point into. The sofa feature can be ignored for now since we assume in this tutorial that the CAS contains only one subject of analysis (document).

Developing Your Annotator Code
Annotator implementations all implement a standard interface (AnalysisComponent), having several methods, the most important of which are:initialize,process, and destroy.
here is a default implementation of this interface for annotators using the JCas, called JCasAnnotator_ImplBase, which has implementations of all required methods except for the process method.

Our annotator class extends the JCasAnnotator_ImplBase; most annotators that use the JCas will extend from this class, so they only have to implement the process method.

Finally, we call annotation.addToIndexes() to add the new annotation to the indexes maintained in the CAS. By default, the CAS implementation used for analysis of text documents keeps an index of all annotations in their order from beginning to end of the document. Subsequent annotators or applications use the indexes to iterate over the annotations.

The UIMA architecture requires that descriptive information about an annotator be represented in an XML file and provided along with the annotator class file(s) to the UIMA framework at run time. This XML file is called an Analysis Engine Descriptor. The descriptor includes:
Name, description, version, and vendor
The annotator's inputs and outputs, defined in terms of the types in a Type System Descriptor
Declaration of the configuration parameters that the annotator accepts

Use Document Analyzer to Test Annotator
Accessing Parameter Values from the Annotator Code
public void initialize(UimaContext aContext)
{
// Get config. parameter values
String[] patternStrings =
(String[]) aContext.getConfigParameterValue("Patterns");
// compile regular expressions
mPatterns = new Pattern[patternStrings.length];
for (int i = 0; i < patternStrings.length; i++) {
mPatterns[i] = Pattern.compile(patternStrings[i]);
}
}
the UimaContext is the annotator's access point for all of the facilities provided by the UIMA framework – for example logging and external resource access.

Logging
getContext().getLogger().log(Level.FINEST,"Found: " + annotation);

1.3. Building Aggregate Analysis Engines
<flowConstraints>
<fixedFlow>
<node>RoomNumber</node>
<node>DateTime</node>
</fixedFlow>
</flowConstraints>

http://uima.apache.org/doc-uima-pears.html
Generating PEAR files
Independent of how PEAR packages are generated, PEAR macros or PEAR variables should be recognized and used. The PEAR architecture defines various macros, but the most important one is the $main_root macro. When using this macro in the installation descriptor or within a UIMA descriptor, it will be substituted with the real PEAR package installation path to the main component root directory after the PEAR package is installed on the target system. For example, this macro can be used to specify the classpath settings for a PEAR component as shown in some of the examples below.

http://uima.apache.org/doc-uima-annotator.html#Packaging the annotator
Use PEAR packager
Right-click on the RoomNumberAnnotator project and call "Generate PEAR file".

Aggregate PERA
http://uima.apache.org/doc-uima-pears.html
During the installation, the package content is extracted and the internal PEAR settings (PEAR macros) are updated with the actual install information. This also means that an installed PEAR package cannot be moved to another directory without internal changes.
Running installed PEAR files

The PEAR package descriptor can also be added to an aggregate analysis engine descriptor as one of the delegates. Therefore, a PEAR can easily be integrated into an analysis chain. But note, the integrated PEAR is treated as a black box and the aggregate analysis engine cannot override any PEAR specific parameters or settings since the PEAR is executed in its own environment with a separate classloader. This also means that resources cannot be shared easily between PEARs. An advantage of this concept is that for example the PEAR specific JCAS classes do not affect the application in case of minor feature differences.

https://uima.apache.org/downloads/releaseDocs/2.3.0-incubating/docs-uima-as/html/uima_async_scaleout/uima_async_scaleout.html
An AS service that is an Aggregate Analysis Engine where the Delegates are also AS components.

http://sujitpal.blogspot.com/2011/12/uima-annotator-to-identify-chemical.html
http://uima.apache.org/annotators
http://metamap.nlm.nih.gov/Docs/README_uima.html
http://mmtx.nlm.nih.gov/

open nlp
https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html
opennlp TokenizerTrainer.conllx help
opennlp TokenizerTrainer.conllx -model en-pos.bin ...
open nlp Models
http://opennlp.sourceforge.net/models-1.5/
Sentence Detection
opennlp SentenceDetector E:\jeffery\src\apache\opennlp\models-1.5\en-sent.bin
InputStream modelIn = new FileInputStream("en-sent.bin");
SentenceModel model = new SentenceModel(modelIn);
SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
String sentences[] = sentenceDetector.sentDetect(" First sentence. Second sentence. ");

Training Tool
The data must be converted to the OpenNLP Sentence Detector training format. Which is one sentence per line. An empty line indicates a document boundary. In case the document boundary is unknown, its recommended to have an empty line every few ten sentences.
$ opennlp SentenceDetectorTrainer -model en-sent.bin -lang en -data en-sent.train -encoding UTF-8

Tokenization
opennlp TokenizerME en-token.bin < article.txt > article-tokenized.txt

http://opennlp.apache.org/documentation/manual/opennlp.html#org.apche.opennlp.uima
go to the apache-opennlp/opennlp folder. Type "mvn install" to build everything
ant -f createPear.xml

UIMA Annotator
AggregateSentenceAE
WhitespaceTokenizer
HMMTagger

http://uima.apache.org/downloads/releaseDocs/2.3.0-incubating/docs/html/tools/tools.html#ugr.tools.pear.installer
runPearInstaller.bat
If no installation directory is specified, the PEAR file is installed to the current working directory.
CAS Visual Debugger (CVD) application.

https://uima.apache.org/doc-uima-annotator.html
Testing the annotator
Open the Eclipse "Run dialog"
Expand "Java Application" in the left window and choose "UIMA CAS Visual Debugger". Now select the "Classpath" tab on the right. Eclipse CVD run dialog
Select the "User Entries" in the classpath tab and press the "Add Projects..." button.
Mark the "RoomNumberAnnotator" project in the upcoming dialog and finish with "OK".
Choose "Run -> Load AE" and select the RoomNumberAnnotatorDescriptor.xml file in the desc folder of your Eclipse project.
Copy and past the text below for testing to the text section of the CVD.

http://uima.apache.org/d/uimaj-2.3.1/tools.html#ugr.tools.pear.installer
PEAR (Processing Engine ARchive) is a new standard for packaging UIMA compliant components. This standard defines several service elements that should be included in the archive package to enable automated installation of the encapsulated UIMA component. The major PEAR service element is an XML Installation Descriptor that specifies installation platform, component attributes, custom installation procedures and environment variables.

http://uima.apache.org/doc-uima-pears.html

https://uima.apache.org/downloads/releaseDocs/2.3.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html

JYuan Learning Log

Monday, March 10, 2014

Notes on Apache UIMA

No comments:

Post a Comment