JYuan Learning Log: March 2014

Monday, March 31, 2014

Notes on JQuery

Change Width, Height
$('#iframe').css('width',$('#Blog1').css('width'));
$('#iframe').width($('#Blog1').width()-30);
var color = $( this ).css( "background-color" );

$( this ).css( "color", "red" );

Check/Set checkbox
$(selector).is(':checked')
$("#ans").attr('checked')

$(".ibtn").attr("checked", "checked");

screen.width
http://stackoverflow.com/questions/1091540/how-to-get-screen-resolution-of-visitor-in-javascript-and-or-php
var browserWidth = $(window).width();
var browserHeight = $(window).height();

$('.column-left-inner').remove();

$('.column-center-outer').css('margin-left', $('.column-center-outer').position().left*-1)

Sunday, March 30, 2014

JavaScript Tips

Google Hosted JS libraries:

https://developers.google.com/speed/libraries/devguide
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js"></script>

StartsWith
http://stackoverflow.com/questions/1767246/javascript-check-if-string-begins-with-something
You can use string.match() and a regular expression for this too:
if(pathname.match(/^\/sub\/1/)) { // you need to escape the slashes
string.match() will return the matching string if found, otherwise null.

if (pathname.substring(0, 6) == "/sub/1") {

http://stackoverflow.com/questions/957537/how-can-i-print-a-javascript-object

Use JSON.stringify(obj); Also this method works with nested objects.

How to get querystring value using jQuery
http://www.jquerybyexample.net/2012/05/how-to-get-querystring-value-using.html
function GetQueryStringParams(sParam)
{
var sPageURL = window.location.search.substring(1);
var sURLVariables = sPageURL.split('&');
for (var i = 0; i < sURLVariables.length; i++)
{
var sParameterName = sURLVariables[i].split('=');
if (sParameterName[0] == sParam)
{
return sParameterName[1];
}
}
}

http://www.designchemical.com/blog/index.php/jquery/8-useful-jquery-snippets-for-urls-querystrings/
Get The Current Page URL
var url = document.URL;
Get The Current Root URL
var root = location.protocol + '//' + location.host;
var param = document.URL.split('#')[1];

Scroll Windows Screen
window.scrollBy(0,500);

Get Random Position In an Array
var randomIdx = Math.floor(Math.random()*iframeUrls.length);

Catch error if iframe src fails to load
http://stackoverflow.com/questions/15273042/catch-error-if-iframe-src-fails-to-load-error-refused-to-display-http-ww
var iframeError;
function change() {
var url = $("#addr").val();
$("#browse").attr("src", url);
iframeError = setTimeout("error()", 5000);
}

function load(e) {
alert(e);
}

function error() {
alert('error');
}

$(document).ready(function () {
$('#browse').on('load', (function () {
load('ok');
clearTimeout(iframeError);
}));

});

Using IFrame

maximize

Dicts IFrame

Friday, March 28, 2014

Maven Tips

Maven 3 Eclipse plugin setup
http://www.itcuties.com/tools/maven-3-eclipse-plugin-setup/
set up maven 3 eclipse plugin to use external maven installation
Navigate to Window->Preferences->Maven->Installations
• Click Add
• Point the external maven installation folder
• Click Apply

Notes on Java Web Development

Template
the ultimate view — Tiles-3
http://tech.finn.no/2012/07/25/the-ultimate-view-tiles-3/

tiles框架的使用记录
http://blog.csdn.net/aj1031689/article/details/13769151

Spring MVC with Sitemesh 3
http://codesilo.wordpress.com/2013/07/11/spring-mvc-with-sitemesh-3/

SiteMesh：一个优于Apache Tiles的Web页面布局、装饰框架
http://www.cnblogs.com/felixjia/p/3496558.html

Google App Engine Tips

https://developers.google.com/appengine/docs/java/tools/maven
mvn archetype:generate -DarchetypeGroupId=com.google.appengine.archetypes -DarchetypeArtifactId=appengine-skeleton-archetype -DarchetypeVersion=1.8.7 -DgroupId=com.mycompany.myapp -DartifactId=myapp -Dversion=BETA2 -Dpackage=com.mycompany

https://developers.google.com/appengine/docs/java/tools/maven
Managing and Running a Project with the App Engine Maven Plugin
mvn clean install && mvn -pl myproject-ear appengine:devserver
<jvmFlags>
<jvmFlag>-Xdebug</jvmFlag>
<jvmFlag>-agentlib:jdwp=transport=dt_socket,address=8000,server=y,suspend=n</jvmFlag>
</jvmFlags>
appengine-skeleton-archetype

http://repo2.maven.org/maven2/com/google/appengine/archetypes/appengine-skeleton-archetype/

https://developers.google.com/appengine/docs/java/tools/uploadinganapp
./appengine-java-sdk/bin/appcfg.sh update myapp/war

Thursday, March 27, 2014

Java REPL: one-the-fly java console

Java REPL: an on-the-fly evaluation of Java statements through web
http://www.javarepl.com/console.html

Notes on Information Retrieval

TF-IDF
http://en.wikipedia.org/wiki/Tf%E2%80%93idf
tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.

http://www.tfidf.com/
TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

Notes on Math

Logarithmic Functio
http://www.mathsisfun.com/sets/function-logarithmic.html
f(x) = loga(x)
loga(x) is the Inverse Function of ax (the Exponential Function)

The Natural Logarithm Function
f(x) = loge(x)
f(x) = ln(x)
Where e is "Eulers Number" = 2.718281828459 (and more ...)

confusion matrix
【计算机】干扰矩阵，混淆矩阵
http://en.wikipedia.org/wiki/Confusion_matrix

http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/confusion_matrix.html

In the field of machine learning, a confusion matrix, also known as a contingency table or an error matrix [1] , is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix).
The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabeling one as another).

Predicted class
Cat	Dog	Rabbit
Actual class	Cat	5	3	0
	Dog	2	3	1
	Rabbit	0	2	11

Wednesday, March 26, 2014

Nutch Tips

In Nutch plugin, read value from underlying data store.
static {
FIELDS.add(WebPage.Field.CONTENT);
FIELDS.add(WebPage.Field.BASE_URL);
}

Java Tips

With Guava you can use Lists.newArrayList(Iterable) or Sets.newHashSet(Iterable), among other similar methods.

http://dayg.wordpress.com/2008/03/28/time-unit-conversion-in-java/
long oneMinute = TimeUnit.SECONDS.toMillis(60);
TimeUnit.DAYS.sleep(45);
scheduledExcecutor.schedule(runMe, 10, TimeUnit.SECONDS);
return TimeUnit.MILLISECONDS.toDays(numberMilliseconds);

How To Convert String To InputStream In Java
http://www.mkyong.com/java/how-to-convert-string-to-inputstream-in-java/
InputStream is = new ByteArrayInputStream(str.getBytes());

BufferedReader br = new BufferedReader(new InputStreamReader(is));

NLP UIMA Tools

U-Compare UIMA Component Repository
http://u-compare.org/components/

http://www.julielab.de/Resources/Software/NLP_Tools.html
JULIE Lab Acronym Annotator

Saturday, March 22, 2014

UIAM Web Resource

Salmon Run
http://sujitpal.blogspot.com/2011/06/uima-analysis-engine-for-keyword.html

Tuesday, March 18, 2014

Notes on Java Common Code

Convert ArrayList To Arrays In Java
http://viralpatel.net/blogs/convert-arraylist-to-arrays-in-java/
String [] countries = list.toArray(new String[list.size()]);

Convert Array to ArrayList
tring[] countries = {"India", "Switzerland", "Italy", "France"};
List list = Arrays.asList(countries);
System.out.println("ArrayList of Countries:" + list);
The above code will work great. But list object is immutable. Thus you will not be able to add new values to it. In case you try to add new value to list, it will throw UnsupportedOperationException.

If you want to create a mutable list:
List list = new ArrayList(Arrays.asList(countries));

Java DOM getNodeValue() and getTextContent()
http://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Node.html
http://stackoverflow.com/questions/5527195/java-dom-gettextcontent-issue
The Node#getTextContext method returns the text content of the current node and its descendants. Use *node.getFirstChild().getNodeValue()*, which prints out the text content for your node and not its descendants.

The javadoc for Node defines "getNodeValue()" to return null for Nodes of type Element.

Monday, March 17, 2014

Notes on Hadoop

http://wiki.apache.org/hadoop/GettingStartedWithHadoop
Data Path Settings - Figure out where your data goes. This includes settings for where the namenode stores the namespace checkpoint and the edits log, where the datanodes store filesystem blocks, storage locations for Map/Reduce intermediate output and temporary storage for the HDFS client. The default values for these paths point to various locations in /tmp. While this might be ok for a single node installation, for larger clusters storing data in /tmp is not an option. These settings must also be in hadoop-site.xml. It is important for these settings to be present in hadoop-site.xml because they can otherwise be overridden by client configuration settings in Map/Reduce jobs.
dfs.name.dir
dfs.data.dir
dfs.client.buffer.dir
mapred.local.dir
An example of a hadoop-site.xml file:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://localhost:54311</value>
</property>
<property>
<name>dfs.replication</name>
<value>8</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx512m</value>
</property>
</configuration>

http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-common/SingleNodeSetup.html
sudo apt-get install ssh
sudo apt-get install rsync

Unpack the downloaded Hadoop distribution. In the distribution, edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java installation.
Try the following command:
bin/hadoop

Standalone Operation
mkdir input
cp conf/*.xml input
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
cat output/*

Notes on Apache OpenNLP

http://www.programcreek.com/2012/05/opennlp-tutorial/

http://www.nuxeo.com/blog/development/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing/

Data Categorization using OpenNLP
http://hanishblogger.blogspot.com/2013/07/data-categorization-using-opennlp.html
http://codego.net/335719/
https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.sentdetect.training.tool
During learning phase a set of pre-labeled data is used to train the model and during testing phase a part of pre-labeled data is used to test the model. Then finally we can use that trained model to classify any data.

OpenNLP Models
http://opennlp.sourceforge.net/models-1.5/
http://www.nuxeo.com/blog/development/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing/

how to train a classifier in opennlp?
For my part, I use directly the package MaxEnt for perform training
and predictions on a task other than the PosTagging, NER, etc.

Some informative links :
(1) http://maxent.sourceforge.net/about.html

(2) http://maxent.sourceforge.net/api/index.html

Http Error Codes

http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

HTTP Error 408 Request timeout
http://www.checkupdown.com/status/E408.html

Solr Code Miscs

org.apache.solr.common.SolrInputDocument.deepCopy()
public SolrInputDocument deepCopy() {
SolrInputDocument clone = new SolrInputDocument();
Set<Entry<String,SolrInputField>> entries = _fields.entrySet();
for (Map.Entry<String,SolrInputField> fieldEntry : entries) {
clone._fields.put(fieldEntry.getKey(), fieldEntry.getValue().deepCopy());
}
clone._documentBoost = _documentBoost;
return clone;
}

org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(AddUpdateCommand)
org.apache.solr.update.processor.DistributedUpdateProcessor.getUpdatedDocument(AddUpdateCommand, long)
{
SolrInputDocument sdoc = cmd.getSolrInputDocument();
BytesRef id = cmd.getIndexedId();
SolrInputDocument oldDoc = RealTimeGetComponent.getInputDocument(cmd.getReq().getCore(), id);
for (SolrInputField sif : sdoc.values()) {
if ("add".equals(key)) {
updateField = true;
oldDoc.addField( sif.getName(), fieldVal, sif.getBoost());
} else if ("set".equals(key)) {
updateField = true;
oldDoc.setField(sif.getName(), fieldVal, sif.getBoost());
} else if ("inc".equals(key)) {
}
} else {
// normal fields are treated as a "set"
oldDoc.put(sif.getName(), sif);
}
}
}
if (searcher == null) {
searcherHolder = core.getRealtimeSearcher();
searcher = searcherHolder.get();
}
int docid = searcher.getFirstMatch(new Term(idField.getName(), idBytes));
org.apache.solr.handler.component.RealTimeGetComponent.toSolrInputDocument(Document, IndexSchema)

org.apache.solr.util.SolrPluginUtils.docListToSolrDocumentList(DocList, SolrIndexSearcher, Set<String>, Map<SolrDocument, Integer>)

Solr Miscs

Atomic Updates
https://wiki.apache.org/solr/Atomic_Updates
<add>
<doc>
<field name="employeeId">05991</field>
<field name="office" update="set">Walla Walla</field>
<field name="skills" update="add">Python</field>
</doc>
</add>
update = "add" | "set" | "inc"
boost = <float> — default is 1.0.

https://wiki.apache.org/solr/UpdateXmlMessages
"delete" documents by ID and by Query
<delete><id>05991</id></delete>
<delete><query>office:Bridgewater</query></delete>
<delete>
<id>05991</id><id>06000</id>
<query>office:Bridgewater</query>
<query>office:Osaka</query>

</delete>

Thursday, March 13, 2014

Notes on Apache UIMA Asynchronous Scaleout

Getting Started: Apache UIMA Asynchronous Scaleout
http://uima.apache.org/doc-uimaas-what.html
the UIMA AS Deployment Descriptor only specifies error handling and scalability options; component aggregation is done using the standard aggregate descriptor. The UIMA Component Descriptor Editor (CDE) has been enhanced to support the UIMA AS Deployment Descriptor.

The shared queue in front of each UIMA AS service is implemented using an Apache ActiveMQ broker. A separate reply queue is created for each client, and every request contains the address of the client's unique reply queue.

http://uima.apache.org/d/uima-as-2.4.2/uima_async_scaleout.html

bin/startBroker.sh/bat: starts the ActiveMQ broker, which must be running before
UIMA AS services can be deployed.
bin/deployAsyncService.sh/bat: deploys an AnalysisEngine as a UIMA-AS
service. Takes one or more UIMA-AS Deployment Descriptors as arguments.
bin/runRemoteAsyncAE.sh/bat: Calls a UIMA-AS service. Takes arguments specifying the
location of the service, and an optional CollectionReader descriptor file used to
obtain the CASes to be processed by the service.

Set UIMA_HOME to the apache-uima-as directory
* Append UIMA_HOME/bin to your PATH
You must run the script UIMA_HOME/bin/adjustExamplePaths.bat (or .sh).

startBroker.bat
INFO TransportServerThreadSupport - Listening for connections at: tcp://yourHostname:61616
Examples can be found in the examples/deploy/as directory,
and the syntax is documented in docs/d/uima_async_scaleout.pdf.
deployAsyncService.cmd [testDD.xml] [-brokerURL tcp://localhost:61616]
If you want to use a different
version of ActiveMQ, set the ACTIVEMQ_HOME environment variable to the location of
ActiveMQ you intend to use.

deployAsyncService.cmd uima-as-2.4.2-bin\examples\deploy\as\Deploy_RoomNumberAnnotator.xml -brokerURL tcp://localhost:61616

getMetaData.cmd tcp://localhost:61616 RoomNumberAnnotatorQueue -verbose

Error deploying pear on AS 2.4.2
http://permalink.gmane.org/gmane.comp.apache.uima.general/5460
https://www.mail-archive.com/user@uima.apache.org/msg02621.html
A pear is a packed UIMA analysis engine, or AE. UIMA-AS deploys services
that contain AEs. The command deployAsyncService requires a UIMA-AS
Deployment Descriptor.
Using the OpenNLP pear described in the DUCC sample app, this would
be a UIMA-AS deployment descriptor, Deploy_OpenNLP.xml

Using the OpenNLP pear described in the DUCC sample app, this would
be a UIMA-AS deployment descriptor
<?xml version="1.0" encoding="UTF-8"?>

<analysisEngineDeploymentDescription
xmlns="http://uima.apache.org/resourceSpecifier">

<name>OpenNLP Text Analyzer</name>
<description>Deploys OpenNLP text analyzer.</description>

<deployment protocol="jms" provider="activemq">
<service>
<inputQueue endpoint="OpenNLP-service"
brokerURL="${defaultBrokerURL}"/>
<topDescriptor>
<import location="opennlp.uima.OpenNlpTextAnalyzer_pear.xml"/>
</topDescriptor>
</service>
</deployment>

</analysisEngineDeploymentDescription>

This descriptor assumes it is in the same directory as the pear descriptor
in the import statement.
Then, assuming the service is to be deployed from the same directory, that
UIMA_HOME and PATH have been updated for the UIMA-AS SDK, and a JMS broker
started with startBroker.sh, the command would be:

UIMA_CLASSPATH=`pwd`/lib deployAsyncService.sh Deploy_OpenNLP.xml

Deploy_MeetingFinder.xml
<topDescriptor>
<import location="MeetingFinderAggregate.xml"/> ===>

</topDescriptor>

Wednesday, March 12, 2014

Notes on Visual Studio

Visual Studio 2010 Format C++ Code: Ctrl+A, Ctrl+K, Ctrl+F
http://stackoverflow.com/questions/8812741/ctrlk-ctrld-not-available-in-visual-studio-2010-working-on-c-project

Visual Studio: Re-enable “Build failed, run last success?”
http://stackoverflow.com/questions/2925125/visual-studio-re-enable-build-failed-run-last-success-dialogue-box
On the menubar go to: 'Tools' --> 'Options'. There you go to 'Project and Solutions' --> 'Build and Run'. There you can find a combobox under the label 'On run, when build or deployment errors occur...'. Select Prompt or Do not launch.

Note on C++

http://stackoverflow.com/questions/8483472/include-skipped-when-looking-for-precompiled-header-use-unexpected-end-of-fi
Everything before the PCH is ignored by the compiler, therefore PCH must come first.
if you used precompiled headers you have to put it on the TOP, such as this:
#include "stdafx.h"
#include "String.h"
#include <iostream>
using namespace std;
#include <string.h>
You using the default MSVC project which includes precompiled headers. I would recommend selecting "Dont use precompiled headers option" when you make a project.

http://stackoverflow.com/questions/12774207/fastest-way-to-check-if-a-file-exist-using-standard-c-c11-c
inline bool is_file_exist(const char *fileName)
{
std::ifstream infile(fileName);
return infile.good();
}
The good method checks if the stream is ready to be read from.
This way you not only check if it exists & is readable, you actually open it.
inline bool exist(const std::string& name)
{
ifstream file(name);
if(!file) //if the file was not found, then file is 0, i.e. !file=1 or true
return false; //the file was not found
else //if the file was found, then file is non-0
return true; //the file was found
}

wstring.c_str()
std::wstring wc( cSize, L'#' );
mbstowcs( &wc[0], c, cSize );

How to: Convert Between Various String Types
http://msdn.microsoft.com/en-us/library/ms235631.aspx
char *orig = "Hello, World!";
size_t newsize = strlen(orig) + 1;
wchar_t * wcstring = new wchar_t[newsize];
// Convert char* string to a wchar_t* string.
size_t convertedChars = 0;
mbstowcs_s(&convertedChars, wcstring, newsize, orig, _TRUNCATE);

C++ String Literals
http://msdn.microsoft.com/en-us/library/69ze775t.aspx
Narrow string literals, represented as "xxx".
Wide string literals, represented as L"xxx".
Raw string literals, represented as R"ddd(xxx) ddd", where ddd is a delimiter. Raw string literals may be either narrow (represented with R) or wide (represented with LR).
const char *narrow = "abcd";
const wchar_t* wide = L"zyxw";

Input/output with files
http://www.cplusplus.com/doc/tutorial/files/
ofstream: Stream class to write on files
ifstream: Stream class to read from files
fstream: Stream class to both read and write from/to files.
ofstream myfile;
myfile.open ("example.txt");
myfile << "Writing this to a file.\n";
myfile.close();

*How I can print the wchar_t values to console?*http://stackoverflow.com/questions/2493785/how-i-can-print-the-wchar-t-values-to-console
Use std::wcout instead of std::cout.
std::cout << "ASCII and ANSI" << std::endl;
std::wcout << L"INSERT MULTIBYTE WCHAR* HERE" << std::endl;
endl
Inserts a newline character into the output sequence and flush the output buffer.

Determines whether a path to a file system object such as a file or folder is valid.
http://msdn.microsoft.com/en-us/library/windows/desktop/bb773584(v=vs.85).aspx
#include <Shlwapi.h>
Add Library Shlwapi.lib by right clicking "Properties" -> "Configuration Properties" -> "Linker" -> "Input" -> "Additional Dependencies' -> "Edit" -> Shlwapi.lib
BOOL exist = PathFileExists(fileNameBuffer);

If you try to creat a file (that does not priviously exits) to use it for writing and reading using code like "fstream file("new_file",ios::out | ios::in) IT WON' T WORCK!!!!!!.
you have to create the file only for writing fstream file("new_file",ios::out) then close the file file.close() and now you can open it again this time for what ever operation you want , for exaple file.open("new_file",ios::out | ios::in | ios::binary) now it will work.

ifstream::open not working in Visual Studio debug mode
Visual Studio sets the working directory to YourProjectDirectory\Debug\Bin when running in debug mode. If your text file is in YourProjectDirectory, you need to account for that difference.

Tuesday, March 11, 2014

Notes on Java AbstractQueuedSynchronizer

http://javaopensourcecode.blogspot.com/2012/10/abstractqueuedsynchronizer-aqs.html
http://gee.cs.oswego.edu/dl/papers/aqs.pdf
The basic algorithm for acquire is try acquire, if successful return else enqueue thread if it is not already queued and block the current thread. Similarly basic algorithm for release is try release, if successful, unblock the first thread in the queue else simply return.
The wait status of the header node will be set to SIGNAL so that as the owner thread releases the lock, it can signal the head node's successor to acquire the lock
http://ifeve.com/aqs-1/
http://ifeve.com/aqs-2/
http://ifeve.com/aqs-3/
但AbstractQueuedSynchronizer类也包含另一组方法（如acquireShared），它们的不同点在于tryAcquireShared和tryReleaseShared方法能够告知框架（通过它们的返回值）尚能接受更多的请求，最终框架会通过级联的signal(cascading signals)唤醒多个线程

ReentrantReadWriteLock类使用AQS同步状态中的16位来保存写锁持有的次数，剩下的16位用来保存读锁的持有次数。WriteLock的构建方式同ReentrantLock。ReadLock则通过使用acquireShared方法来支持同时允许多个读线程。

Semaphore类（计数信号量）使用AQS同步状态来保存信号量的当前计数。它里面定义的acquireShared方法会减少计数，或当计数为非正值时阻塞线程；tryRelease方法会增加计数，可能在计数为正值时还要解除线程的阻塞。

CountDownLatch类使用AQS同步状态来表示计数。当该计数为0时，所有的acquire操作（译者注：acquire操作是从aqs的角度说的，对应到CountDownLatch中就是await方法）才能通过。

FutureTask类使用AQS同步状态来表示某个异步计算任务的运行状态（初始化、运行中、被取消和完成）。设置（译者注：FutureTask的set方法）或取消（译者注：FutureTask的cancel方法）一个FutureTask时会调用AQS的release操作，等待计算结果的线程的阻塞解除是通过AQS的acquire操作实现的。

SynchronousQueues类（一种CSP（Communicating Sequential Processes）形式的传递）使用了内部的等待节点，这些节点可以用于协调生产者和消费者。同时，它使用AQS同步状态来控制当某个消费者消费当前一项时，允许一个生产者继续生产，反之亦然。

LockSupport最主要的作用，便是通过一个许可(permit)状态，解决了这个问题。
http://whitesock.iteye.com/blog/1336409
正如每个Object都有一个锁，每个Object也有一个等待集合（wait set），它有wait、notify、notifyAll和Thread.interrupt方法来操作。同时拥有锁和等待集合的实体，通常被成为监视器（monitor）。每个Object的等待集合是由JVM维护的。等待集合一直存放着那些因为调用对象的wait方法而被阻塞的线程。由于等待集合和锁之间的交互机制，只有获得目标对象的同步锁时，才可以调用它的wait、notify和notifyAll方法。这种要求通常无法靠编译来检查，如果条件不能满足，那么在运行的时候调用以上方法就会导致其抛出IllegalMonitorStateException。
LockSupport通过许可(permit)机制保证：如果当前线程拥有许可，那么park系列方法会消费掉该许可，并且立即返回（不会被阻塞）。

WaitQueue是AbstractQueuedSynchronizer的核心，它用于保存被阻塞的线程。它的实现是"CLH" (Craig, Landin, and Hagersten) lock queue的一个变种。
AbstractQueuedSynchronizer的静态内部类Node维护了一个FIFO的等待队列。跟CLH不同的是，Node中包含了指向predecessor和sucessor的引用。predecessor引用的作用是为了支持锁等待超时（timeout）和锁等待回退（cancellation）的功能。sucessor的作用是为了支持线程阻塞.
Node中还包括一个volatile int waitStatus成员变量用于控制线程的阻塞/唤醒，以及避免不必要的调用LockSupport的park/unpark方法
http://whitesock.iteye.com/blog/1337374
shouldParkAfterFailedAcquire方法会确保每个线程在被阻塞之前，其对应WaitQueue中的节点的waitStatus被设置为Node.SIGNAL（-1），以便在release时避免不必要的unpark操作。此外shouldParkAfterFailedAcquire还会清理WaitQueue中已经超时或者取消的Node。需要注意的是，在某个线程最终被阻塞之前，tryAcquire可能会被多次调用。

release方法中，总是总head节点开始向后查找sucessor。只有当该sucessor的waitStatus被设置的情况下才会调用unparkSuccessor。unparkSuccessor方法中首先清除之前设置的Node.waitStatus，然后向后查找并且唤醒第一个需要被唤醒的sucessor。需要注意的是，if (s == null || s.waitStatus > 0)这个分支中，查找是从tail节点开始，根据prev引用向前进行。在Inside AbstractQueuedSynchronizer (2) 中提到过，Node.next为null并不一定意味着没有sucessor，虽然WaitQueue是个双向链表，但是根据next引用向后查找sucessor不靠谱，而根据prev引用向前查找predecessor总是靠谱。
Fairness
tryAcquire的调用顺序先于acquireQueued，也就是说后来的线程可能在等待中的线程之前acquire成功。这种场景被称为barging FIFO strategy，它能提供更高的吞吐量。
大多数AbstractQueuedSynchronizer的子类都同时提供了公平和非公平的实现，例如ReentrantLock提供了NonfairSync和FairSync
FairSync优先确保等待中线程先acquire成功。但是公平性也不是绝对的：在一个多线程并发的环境下，就算锁的获取是公平的，也不保证后续的其它处理过程的先后顺序。

既然默认情况下使用的都是NonfairSync，那么FairSync适合什么样的场景呢？如果被锁所保护的代码段的执行时间比较长，而应用又不能接受线程饥饿（NonfairSync可能会导致虽然某个线程长时间排队，但是仍然无法获得锁的情况）的场景下可以考虑使用FairSync。对于ReentrantLock，在其构造函数中传入true，即可构造一把公平锁。

http://whitesock.iteye.com/blog/1337539
3.6 ConditionObject
AbstractQueuedSynchronizer的内部类ConditionObject实现了Condition接口。Condition接口提供了跟Java语言内置的monitor机制类似的接口：await()/signal()/signalAll()，以及一些支持超时和回退的await版本。可以将任意个数的ConcitionObject关联到对应的synchronizer，例如通过调用ReentrantLock.newCondition()方法即可构造一个ConditionObject实例。每个ConditionObject实例内部都维护一个ConditionQueue，该队列的元素跟AbstractQueuedSynchronizer的WaitQueue一样，都是Node对象。

http://ifeve.com/introduce-abstractqueuedsynchronizer/
该同步器即可以作为排他模式也可以作为共享模式，当它被定义为一个排他模式时，其他线程对其的获取就被阻止，而共享模式对于多个线程获取都可以成功。
同步器面向的是线程访问和资源控制，它定义了线程对资源是否能够获取以及线程的排队等操作。

队列中的元素Node就是保存着线程引用和线程状态的容器，每个线程对同步器的访问，都可以看做是队列中的一个节点。
Node nextWaiter 存储condition队列中的后继节点。
setExclusiveOwnerThread(Thread.currentThread());
// 仅需要将操作代理到Sync上即可

1. 尝试获取（调用tryAcquire更改状态，需要保证原子性）；
在tryAcquire方法中使用了同步器提供的对state操作的方法，利用compareAndSet保证只有一个线程能够对状态进行成功修改，而没有成功修改的线程将进入sync队列排队。
2. 如果获取不到，将当前线程构造成节点Node并加入sync队列；
进入队列的每个线程都是一个节点Node，从而形成了一个双向队列，类似CLH队列，这样做的目的是线程间的通信会被限制在较小规模（也就是两个节点左右）。
3. 再次尝试获取，如果没有获取到那么将当前线程从线程调度器上摘下，进入等待状态
final boolean acquireQueued(final Node node, int arg) {
1. 获取当前节点的前驱节点；
2. 当前驱节点是头结点并且能够获取状态，代表该当前节点占有锁；
如果满足上述条件，那么代表能够占有锁，根据节点对锁占有的含义，设置头结点为当前节点。
3. 否则进入等待状态。
public final boolean release(int arg) {
1. 尝试释放状态；
2. 唤醒当前节点的后继节点所包含的线程。
public final void acquireShared(int arg) {
4. 获取共享状态成功；
在退出队列的条件上，和独占锁之间的主要区别在于获取共享状态成功之后的行为，而如果共享状态获取成功之后会判断后继节点是否是共享模式，如果是共享模式，那么就直接对其进行唤醒操作，也就是同时激发多个线程并发的运行。
http://my.oschina.net/lifany/blog/173019
在 AQS 中，有两个重要的数据结构，一个是 volatile int state，另一个是 class Node 组成的双向链表。
int state
顾名思义，这个变量是用来表示 AQS 的状态的，例如 ReentrantLock 的锁的状态和重入次数、FutureTask 中任务的状态、CountDownLatch 中的 count 计数等等。这个值的更新都是由 AQS compareAndSetState 方法来实现的，而这个方法则是通过 Compare and Swap 算法实现
Node 双向链表

State
在 AQS 中有一个 int 类型的 volatile 变量 state，使用 AQS 的类可以自定义 state 对其的含义。例如，ReentrantLock 用 0 表示没有线程获取锁，大于 0 则表示重入锁的重入次数；Semaphore 用来表示许可数量；FutureTask 用来表示任务状态，例如运行中、已完成等。

在扩展 AQS 时，子类需要根据自己的需求在诸如 tryAcquire 方法中使用 compareAndSetState 方法设置相应的状态值。
以 CountDownLatch 为例，它的 tryAcquireShared 的实现如下：
ryAcquire 和 tryAcquireShared 的方法定义存在一个巨大的不同，就是返回值的不同。tryAcquire 返回的是 boolean 类型，其分别表示 acquire 成功或失败，而 tryAcquireShared 返回的却是 int 类型，负、零、正代表三种含义：失败、独占获取、共享获取。这就是 AQS 文档中对独占模式和共享模式描述中的那段“可能（但不是一定）”的原因。

protected int tryAcquireShared(int acquires) {
return (getState() == 0) ? 1 : -1;
}

private volatile int state;
private transient volatile Node head;
private transient volatile Node tail;

Monday, March 10, 2014

Notes on Apache UIMA

http://uima.apache.org/d/uimaj-2.6.0/references.html

The feature's rangeTypeName specifies the type of value that the feature can take. This may be the name of any type defined in your type system, or one of the predefined types. All of the predefined types have names that are prefixed with uima.cas or uima.tcas, for example:

uima.cas.TOP 
uima.cas.String
uima.cas.Long 
uima.cas.FSArray
uima.cas.StringList
uima.tcas.Annotation.

For a complete list of predefined types, see the CAS API documentation.

The elementType of a feature is optional, and applies only when the rangeTypeName is uima.cas.FSArray or uima.cas.FSList TheelementType specifies what type of value can be assigned as an element of the array or list. This must be the name of a non-primitive type. If omitted, it defaults to uima.cas.TOP, meaning that any FeatureStructure can be assigned as an element the array or list. Note: depending on the CAS Interface that you use in your code, this constraint may or may not be enforced. Note: At run time, the elementType is available from a runtime Feature object (using the a_feature_object.getRange().getComponentType() method) only when specified for theuima.cas.FSArray ranges; it isn't available for uima.cas.FSList ranges.

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.uima/textmarker-core/2.0.0/org/apache/uima/textmarker/engine/InternalTypeSystem.xml

       <featureDescription>
          <name>rules</name>
          <description/>
          <rangeTypeName>uima.cas.FSArray</rangeTypeName>
          <elementType>org.apache.uima.textmarker.type.DebugRuleMatch</elementType>
        </featureDescription>

Sofa: Subject of Analysis

Check C:\Users\usera\uima.log
Running The UIMA Analysis Example
uima PEAR Installer User's Guide
runPearInstaller.bat then cvd.bat
To launch the PEAR Installer, use the script in the UIMA bin directory: runPearInstaller.bat or runPearInstaller.sh.
http://uima.apache.org/downloads/sandbox/simpleServerUserGuide/simpleServerUserGuide.html
https://uima.apache.org/downloads/releaseDocs/2.3.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html
An Analysis Engine (AE) is a program that analyzes artifacts (e.g. documents) and infers information from them.
An Analysis Engine (AE) may contain a single annotator (this is referred to as a Primitive AE), or it may be a composition of others and therefore contain multiple annotators (this is referred to as an Aggregate AE). Primitive and aggregate AEs implement the same interface and can be used interchangeably by applications.

Annotators produce their analysis results in the form of typed Feature Structures, which are simply data structures that have a type and a set of (attribute, value) pairs. An annotation is a particular type of Feature Structure that is attached to a region of the artifact being analyzed (a span of text in a document, for example).

It is also possible for annotators to record information associated with the entire document rather than a particular span (these are considered Feature Structures but not Annotations).
All feature structures, including annotations, are represented in the UIMA Common Analysis Structure(CAS). The CAS is the central data structure through which all UIMA components communicate.

Defining Types
The first step in developing an annotator is to define the CAS Feature Structure types that it creates. This is done in an XML file called a Type System Descriptor. UIMA defines basic primitive types as well as Arrays of these primitive types. UIMA also defines the built-in types TOP, which is the root of the type system, analogous to Object in Java; FSArray, which is an array of Feature Structures (i.e. an array of instances of TOP); and Annotation

The built-in Annotation type declares three fields (called Features in CAS terminology). The features begin and end store the character offsets of the span of text to which the annotation refers. The feature sofa (Subject of Analysis) indicates which document the begin and end offsets point into. The sofa feature can be ignored for now since we assume in this tutorial that the CAS contains only one subject of analysis (document).

Developing Your Annotator Code
Annotator implementations all implement a standard interface (AnalysisComponent), having several methods, the most important of which are:initialize,process, and destroy.
here is a default implementation of this interface for annotators using the JCas, called JCasAnnotator_ImplBase, which has implementations of all required methods except for the process method.

Our annotator class extends the JCasAnnotator_ImplBase; most annotators that use the JCas will extend from this class, so they only have to implement the process method.

Finally, we call annotation.addToIndexes() to add the new annotation to the indexes maintained in the CAS. By default, the CAS implementation used for analysis of text documents keeps an index of all annotations in their order from beginning to end of the document. Subsequent annotators or applications use the indexes to iterate over the annotations.

The UIMA architecture requires that descriptive information about an annotator be represented in an XML file and provided along with the annotator class file(s) to the UIMA framework at run time. This XML file is called an Analysis Engine Descriptor. The descriptor includes:
Name, description, version, and vendor
The annotator's inputs and outputs, defined in terms of the types in a Type System Descriptor
Declaration of the configuration parameters that the annotator accepts

Use Document Analyzer to Test Annotator
Accessing Parameter Values from the Annotator Code
public void initialize(UimaContext aContext)
{
// Get config. parameter values
String[] patternStrings =
(String[]) aContext.getConfigParameterValue("Patterns");
// compile regular expressions
mPatterns = new Pattern[patternStrings.length];
for (int i = 0; i < patternStrings.length; i++) {
mPatterns[i] = Pattern.compile(patternStrings[i]);
}
}
the UimaContext is the annotator's access point for all of the facilities provided by the UIMA framework – for example logging and external resource access.

Logging
getContext().getLogger().log(Level.FINEST,"Found: " + annotation);

1.3. Building Aggregate Analysis Engines
<flowConstraints>
<fixedFlow>
<node>RoomNumber</node>
<node>DateTime</node>
</fixedFlow>
</flowConstraints>

http://uima.apache.org/doc-uima-pears.html
Generating PEAR files
Independent of how PEAR packages are generated, PEAR macros or PEAR variables should be recognized and used. The PEAR architecture defines various macros, but the most important one is the $main_root macro. When using this macro in the installation descriptor or within a UIMA descriptor, it will be substituted with the real PEAR package installation path to the main component root directory after the PEAR package is installed on the target system. For example, this macro can be used to specify the classpath settings for a PEAR component as shown in some of the examples below.

http://uima.apache.org/doc-uima-annotator.html#Packaging the annotator
Use PEAR packager
Right-click on the RoomNumberAnnotator project and call "Generate PEAR file".

Aggregate PERA
http://uima.apache.org/doc-uima-pears.html
During the installation, the package content is extracted and the internal PEAR settings (PEAR macros) are updated with the actual install information. This also means that an installed PEAR package cannot be moved to another directory without internal changes.
Running installed PEAR files

The PEAR package descriptor can also be added to an aggregate analysis engine descriptor as one of the delegates. Therefore, a PEAR can easily be integrated into an analysis chain. But note, the integrated PEAR is treated as a black box and the aggregate analysis engine cannot override any PEAR specific parameters or settings since the PEAR is executed in its own environment with a separate classloader. This also means that resources cannot be shared easily between PEARs. An advantage of this concept is that for example the PEAR specific JCAS classes do not affect the application in case of minor feature differences.

https://uima.apache.org/downloads/releaseDocs/2.3.0-incubating/docs-uima-as/html/uima_async_scaleout/uima_async_scaleout.html
An AS service that is an Aggregate Analysis Engine where the Delegates are also AS components.

http://sujitpal.blogspot.com/2011/12/uima-annotator-to-identify-chemical.html
http://uima.apache.org/annotators
http://metamap.nlm.nih.gov/Docs/README_uima.html
http://mmtx.nlm.nih.gov/

open nlp
https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html
opennlp TokenizerTrainer.conllx help
opennlp TokenizerTrainer.conllx -model en-pos.bin ...
open nlp Models
http://opennlp.sourceforge.net/models-1.5/
Sentence Detection
opennlp SentenceDetector E:\jeffery\src\apache\opennlp\models-1.5\en-sent.bin
InputStream modelIn = new FileInputStream("en-sent.bin");
SentenceModel model = new SentenceModel(modelIn);
SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
String sentences[] = sentenceDetector.sentDetect(" First sentence. Second sentence. ");

Training Tool
The data must be converted to the OpenNLP Sentence Detector training format. Which is one sentence per line. An empty line indicates a document boundary. In case the document boundary is unknown, its recommended to have an empty line every few ten sentences.
$ opennlp SentenceDetectorTrainer -model en-sent.bin -lang en -data en-sent.train -encoding UTF-8

Tokenization
opennlp TokenizerME en-token.bin < article.txt > article-tokenized.txt

http://opennlp.apache.org/documentation/manual/opennlp.html#org.apche.opennlp.uima
go to the apache-opennlp/opennlp folder. Type "mvn install" to build everything
ant -f createPear.xml

UIMA Annotator
AggregateSentenceAE
WhitespaceTokenizer
HMMTagger

http://uima.apache.org/downloads/releaseDocs/2.3.0-incubating/docs/html/tools/tools.html#ugr.tools.pear.installer
runPearInstaller.bat
If no installation directory is specified, the PEAR file is installed to the current working directory.
CAS Visual Debugger (CVD) application.

https://uima.apache.org/doc-uima-annotator.html
Testing the annotator
Open the Eclipse "Run dialog"
Expand "Java Application" in the left window and choose "UIMA CAS Visual Debugger". Now select the "Classpath" tab on the right. Eclipse CVD run dialog
Select the "User Entries" in the classpath tab and press the "Add Projects..." button.
Mark the "RoomNumberAnnotator" project in the upcoming dialog and finish with "OK".
Choose "Run -> Load AE" and select the RoomNumberAnnotatorDescriptor.xml file in the desc folder of your Eclipse project.
Copy and past the text below for testing to the text section of the CVD.

http://uima.apache.org/d/uimaj-2.3.1/tools.html#ugr.tools.pear.installer
PEAR (Processing Engine ARchive) is a new standard for packaging UIMA compliant components. This standard defines several service elements that should be included in the archive package to enable automated installation of the encapsulated UIMA component. The major PEAR service element is an XML Installation Descriptor that specifies installation platform, component attributes, custom installation procedures and environment variables.

http://uima.apache.org/doc-uima-pears.html

https://uima.apache.org/downloads/releaseDocs/2.3.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html