JYuan Learning Log: Learning Mahout

For reference, the R community is much bigger and in the Java world we have had the RapidMiner and Weka frameworks present on the scene for many years.

Do I need Hadoop to use Mahout?
There is a number of algorithm implementations that require no Hadoop dependencies whatsoever, consult the algorithms list. In the future, we might provide more algorithm implementations on platforms more suitable for machine learning such as Apache Spark

a recommender is a software that is able to make suggestions on new or

existing preferences from previously recorded preferences.

Read a huge amount of data that maps a user with some preferences for an item

Find an item that should be suggested to the user

mahout seqdirectory -i $WORK_DIR/original -o $WORK_DIR/sequencesfiles

Sequence files are binary encoding of key/value pairs. There is a header on the top of the file
organized with some metadata information which includes:
Version
Key name
Value name
Compression
mahout seqdumper -i $WORK_DIR/sequencesfiles/chunk-0 | more
hadoop dfs -text part-0000

Mahout gives the possibility of reading a sequence file and converting every key/value into a text
format.
Importing an external datasource into HDFS
RDBMS will be outside the machine where Sqoop is running

hdfs namenode -format

sqoop import-all-tables --connect jdbc:mysql://localhost/bbdatabank --user root -P
--verbose

sqoop –options-file <path_to_file>
hadoop fs –ls
hadoop –fs tail TEAMS
sqoop export --connect jdbc:mysql://localhost/bbdatabank --user root -P --verbose -
-export-dir /export/ --table results

sqoop job --create myimportjob -- import-all-tables --connect
jdbc:mysql://localhost/bbdatabank --user root -P --verbose

Using the Mahout text classifier to demonstrate the basic use case
Using the Naïve Bayes classifier from code
Using Complementary Naïve Bayes from the command line
Coding the Complementary Naïve Bayes classifier
Given a
dataset, that is, a set of observations of many variables, a classifier is able to assign a new observation
to a particular category.

./mahout seqdirectory -i ${WORK_DIR}/20news-all -o ${WORK_DIR}/20news-seq
To examine the outcome, you can use the Hadoop command-line option fs.
hadoop fs -text $WORK_DIR/20news-seq/chunck-0 | more

The Naïve Bayes algorithm does not work directly with the words and the raw text, but with the weighted
vector associated to the original document. So now, we need to transform the raw text into vectors of
weights and frequency.

./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf

The -lnorm parameter instructs the vector to use the L_2 norm as a distance
The -nv parameter is an optional parameter that outputs the vector as namedVector
The -wt parameter instructs which weight function needs to be used

mahout split -i 20news-vectors/tfidf-vectors --trainingOutput 20news-train-vectors --testOutput 20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow
mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o 20news-testing

mahout vectordump -i 20news-vectors/tfidf-vectors/part-r-00000 -o 20news-vectors/tfidf-vectors/part-r-00000dump

cut -d , -f 2-7 google.csv > training.csv

If the closing price of the previous day is lower than that of the current day, update the last column to

SELL or to BUY.

The assumption is

that the BUY action depends on a combination of the other dependent inputs, plus some coefficients.

mahout trainlogistic --input $WORK_DIR/training/final.csv --output $WORK_DIR/model/model --target Action --predictors Open Close High --types word --features 20 --passes 100 --rate 50 --categories 2

mahout runLogistic --input $WORK_DIR/training/enter.csv --model $WORK_DIR/model/model --auc --confusion

AUC = 0.87
confusion: [[864.0, 176.0], [165.0, 933.0]]
The first parameter is the acronym for Area Under the Curve. As the area under the logistic can be
expressed as a probability, this parameter is the probability of the model that classifies the data
correctly. The AUC parameter tells us the number of true positives, so the higher the value between 0
and 1, the fewer false positives we have.

Besides, we have the confusion matrix that tells that the model performs well in 864 out of 1040 test
cases (864 + 176).

Using adaptive logistic regression

JYuan Learning Log

Sunday, October 12, 2014

Learning Mahout

No comments:

Post a Comment