JYuan Learning Log: Notes on Taming Text

Filtering names based on probability

To determine the probability associated with a particular name, you can call the NameFinderME.getProbs() method after each sentence has been processed,
double[] spanProbs = finder.probs(names);

This model uses a set of features, specified in the code, to predict which outcome is most likely. These features are designed to distinguish proper names, different types of numeric strings, and the surrounding context of words and tagging decisions.

Many of these features are based on the token being tagged and its adjacent neighbors,
but some of these features are based on the token class. A token’s class is based
upon basic characteristics of the token, such as whether it consists solely of lowercase
characters.

This leads to a dual approach where some documents are picked randomly and some are
picked based on proximity/probability.

Carrot2 comes with two clustering implementations: STC (suffix tree clustering) and Lingo.

At a high level, Lingo uses singular value decomposition (SVD; see http://en.wikipedia.org/wiki/Singular_value_decomposition to learn more) to find good clusters and phrase discovery to identify good labels for those clusters.

For clustering, Mahout relies on data to be in an org.apache.mahout.matrix.Vector
format. A Vector in Mahout is simply a tuple of floats, as in <0.5, 1.9, 100.5>. More
generally speaking, a vector, often called a feature vector, is a common data structure
used in machine learning to represent the properties of a document or other piece of
data to the system. Depending on the data, vectors are often either densely populated
or sparse.

CREATING VECTORS FROM AN APACHE LUCENE INDEX
Given an index, we can use Mahout’s Lucene utilities to convert the index to a SequenceFile containing Vectors.

K-Means is a simple and
straightforward approach to clustering that often yields good results relatively quickly.
It operates by iteratively adding documents to one of k clusters based on the distance,
as determined by a user-supplied distance measure, between the document and the
centroid of that cluster. At the end of each iteration, the centroid may be recalculated.
The process stops after there’s little-to-no change in the centroids or some maximum
number of iterations have passed, since otherwise K-Means isn’t guaranteed to converge.
The algorithm is kicked off by either seeding it with some initial centroids or by
randomly choosing centroids from the set of vectors in the input dataset. K-Means
does have some downsides. First and foremost, you must pick k and naturally you’ll get
different results for different values of k. Furthermore, the initial choice for the centroids
can greatly affect the outcome, so you should be sure to try different values as
part of several runs.

JYuan Learning Log

Monday, November 17, 2014

Notes on Taming Text

No comments:

Post a Comment