JYuan Learning Log: October 2014

Wednesday, October 29, 2014

Learning Angularjs

Bootstrapping AngularJS This is done through the ng-app directive.

{{1 + 2}} The double curly is an AngularJS syntax to denote either one-way data-binding or AngularJS expressions. If it refers to a variable, it keeps the UI up to date with changes in the value. If it is an expression, AngularJS evaluates it and keeps the UI up to date if the value of the expression changes.

input type="text" ng-model="name" placeholder="Enter your name"

The ng-model directive is used with input fields whenever we want the user to enter any data and get access to the value in JavaScript. Here, we tell AngularJS to store the value that the user types into this field in a variable called name.

span ng-bind="name"

ng-bind and the double-curly notation are interchangeable.


Modules are AngularJS’s way of packaging relevant code under a single name. For someone coming from a Java background, a simple analogy is to think of modules as packages.

An AngularJS module has two parts to it:


A module can define its own controllers, services, factories, and directives. These are functions and code that can be accessed throughout the module.
The module can also depend on other modules as dependencies, which are defined when the module is instantiated. What this means is that AngularJS will go and find the module with that particular name, and ensure that any functions, controllers, services, etc. defined in that module are made available to all the code defined in this module.

In addition to being a container for related JavaScript, the module is also what AngularJS uses to bootstrap an application. What that means is that we can tell AngularJS what module to load as the main entry point for the application by passing the module name to the ng-app directive.



angular.module('notesApp', []);


angular.module('notesApp',
    ['notesApp.ui', 'thirdCompany.fusioncharts']);


angular.module('notesApp');

The ng-app directive takes an optional argument, which is the name of the module to load during bootstrapping.
html ng-app="notesApp"

An AngularJS controller is almost always directly linked to a view or HTML. We will never have a controller that is not used in the UI (that kind of business logic goes into services). It acts as the gateway between our model, which is the data that drives our application, and the view, which is what the user sees and interacts with.

angular.module('notesApp', [])
    .controller('MainCtrl', [function() {
      // Controller-specific code goes here
      console.log('MainCtrl has been created');
    }]);

ng-controller="MainCtrl"

This is used to tell AngularJS to go instantiate an instance of the controller with the given name, and attach it to the DOM element.

ng-controller="MainCtrl as ctrl"

{{ctrl.helloMsg}} AngularJS.
  {{ctrl.goodbyeMsg}} AngularJS

  angular.module('notesApp', [])
    .controller('MainCtrl', [function() {
       this.helloMsg = 'Hello ';
       var goodbyeMsg = 'Goodbye ';
  }]);

variables that were defined on the this keyword in the controller are accessible from the HTML, but local, inner variables are not.

Furthermore, any variable defined on the controller instance (on this in the controller, as opposed to declaring variables with the var keyword like goodbyeMsg) can be accessed and displayed to the user via the HTML. This is basically how we funnel and expose data from our controller and business logic to the UI.

button ng-click="ctrl.changeMessage()"

as good practice, we avoid referring to the this keyword inside the controller, preferring to use a proxy selfvariable, which points to this.

the this keyword inside a function can be overridden by whoever calls the function. Thus, the this outside and inside a function can refer to two completely different objects or scopes.

Thus, it is generally better to assign the this reference inside a controller to a proxy variable, and always refer to the instance through this proxy (self, for example) to be assured that the instance we are referring to is the correct one.

div ng-repeat="note in ctrl.notes"

AngularJS has a directive called ng-cloak, which is a mechanism to hide sections of the page while AngularJS bootstraps and finishes loading.

AngularJS creates scopes or context for various elements in the DOM to ensure that there is no global state and each element accesses only what is relevant to it. These scopes have a parent-child relation by default, which allows children scopes to access functions and controllers from a parent scope.

Sunday, October 12, 2014

Learning Mahout

For reference, the R community is much bigger and in the Java world we have had the RapidMiner and Weka frameworks present on the scene for many years.

Do I need Hadoop to use Mahout?
There is a number of algorithm implementations that require no Hadoop dependencies whatsoever, consult the algorithms list. In the future, we might provide more algorithm implementations on platforms more suitable for machine learning such as Apache Spark

a recommender is a software that is able to make suggestions on new or

existing preferences from previously recorded preferences.

Read a huge amount of data that maps a user with some preferences for an item

Find an item that should be suggested to the user

mahout seqdirectory -i $WORK_DIR/original -o $WORK_DIR/sequencesfiles

Sequence files are binary encoding of key/value pairs. There is a header on the top of the file
organized with some metadata information which includes:
Version
Key name
Value name
Compression
mahout seqdumper -i $WORK_DIR/sequencesfiles/chunk-0 | more
hadoop dfs -text part-0000

Mahout gives the possibility of reading a sequence file and converting every key/value into a text
format.
Importing an external datasource into HDFS
RDBMS will be outside the machine where Sqoop is running

hdfs namenode -format

sqoop import-all-tables --connect jdbc:mysql://localhost/bbdatabank --user root -P
--verbose

sqoop –options-file <path_to_file>
hadoop fs –ls
hadoop –fs tail TEAMS
sqoop export --connect jdbc:mysql://localhost/bbdatabank --user root -P --verbose -
-export-dir /export/ --table results

sqoop job --create myimportjob -- import-all-tables --connect
jdbc:mysql://localhost/bbdatabank --user root -P --verbose

Using the Mahout text classifier to demonstrate the basic use case
Using the Naïve Bayes classifier from code
Using Complementary Naïve Bayes from the command line
Coding the Complementary Naïve Bayes classifier
Given a
dataset, that is, a set of observations of many variables, a classifier is able to assign a new observation
to a particular category.

./mahout seqdirectory -i ${WORK_DIR}/20news-all -o ${WORK_DIR}/20news-seq
To examine the outcome, you can use the Hadoop command-line option fs.
hadoop fs -text $WORK_DIR/20news-seq/chunck-0 | more

The Naïve Bayes algorithm does not work directly with the words and the raw text, but with the weighted
vector associated to the original document. So now, we need to transform the raw text into vectors of
weights and frequency.

./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf

The -lnorm parameter instructs the vector to use the L_2 norm as a distance
The -nv parameter is an optional parameter that outputs the vector as namedVector
The -wt parameter instructs which weight function needs to be used

mahout split -i 20news-vectors/tfidf-vectors --trainingOutput 20news-train-vectors --testOutput 20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow
mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o 20news-testing

mahout vectordump -i 20news-vectors/tfidf-vectors/part-r-00000 -o 20news-vectors/tfidf-vectors/part-r-00000dump

cut -d , -f 2-7 google.csv > training.csv

If the closing price of the previous day is lower than that of the current day, update the last column to

SELL or to BUY.

The assumption is

that the BUY action depends on a combination of the other dependent inputs, plus some coefficients.

mahout trainlogistic --input $WORK_DIR/training/final.csv --output $WORK_DIR/model/model --target Action --predictors Open Close High --types word --features 20 --passes 100 --rate 50 --categories 2

mahout runLogistic --input $WORK_DIR/training/enter.csv --model $WORK_DIR/model/model --auc --confusion

AUC = 0.87
confusion: [[864.0, 176.0], [165.0, 933.0]]
The first parameter is the acronym for Area Under the Curve. As the area under the logistic can be
expressed as a probability, this parameter is the probability of the model that classifies the data
correctly. The AUC parameter tells us the number of true positives, so the higher the value between 0
and 1, the fewer false positives we have.

Besides, we have the confusion matrix that tells that the model performs well in 864 out of 1040 test
cases (864 + 176).

Using adaptive logistic regression