For reference, the R community is much bigger and in the Java world we have had the RapidMiner and Weka frameworks present on the scene for many years.
Do I need Hadoop to use Mahout?
There is a number of algorithm implementations that require no Hadoop dependencies whatsoever, consult the algorithms list. In the future, we might provide more algorithm implementations on platforms more suitable for machine learning such as Apache Spark
mahout seqdirectory -i $WORK_DIR/original -o $WORK_DIR/sequencesfiles
Sequence files are binary encoding of key/value pairs. There is a header on the top of the file
organized with some metadata information which includes:
Version
Key name
Value name
Compression
mahout seqdumper -i $WORK_DIR/sequencesfiles/chunk-0 | more
hadoop dfs -text part-0000
Mahout gives the possibility of reading a sequence file and converting every key/value into a text
format.
Importing an external datasource into HDFS
RDBMS will be outside the machine where Sqoop is running
hdfs namenode -format
sqoop import-all-tables --connect jdbc:mysql://localhost/bbdatabank --user root -P
--verbose
sqoop –options-file <path_to_file>
hadoop fs –ls
hadoop –fs tail TEAMS
sqoop export --connect jdbc:mysql://localhost/bbdatabank --user root -P --verbose -
-export-dir /export/ --table results
sqoop job --create myimportjob -- import-all-tables --connect
jdbc:mysql://localhost/bbdatabank --user root -P --verbose
Using the Mahout text classifier to demonstrate the basic use case
Using the Naïve Bayes classifier from code
Using Complementary Naïve Bayes from the command line
Coding the Complementary Naïve Bayes classifier
Given a
dataset, that is, a set of observations of many variables, a classifier is able to assign a new observation
to a particular category.
./mahout seqdirectory -i ${WORK_DIR}/20news-all -o ${WORK_DIR}/20news-seq
To examine the outcome, you can use the Hadoop command-line option fs.
hadoop fs -text $WORK_DIR/20news-seq/chunck-0 | more
The Naïve Bayes algorithm does not work directly with the words and the raw text, but with the weighted
vector associated to the original document. So now, we need to transform the raw text into vectors of
weights and frequency.
./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf
The -lnorm parameter instructs the vector to use the L_2 norm as a distance
The -nv parameter is an optional parameter that outputs the vector as namedVector
The -wt parameter instructs which weight function needs to be used
mahout split -i 20news-vectors/tfidf-vectors --trainingOutput 20news-train-vectors --testOutput 20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow
mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o 20news-testing
mahout vectordump -i 20news-vectors/tfidf-vectors/part-r-00000 -o 20news-vectors/tfidf-vectors/part-r-00000dump
cut -d , -f 2-7 google.csv > training.csv
If the closing price of the previous day is lower than that of the current day, update the last column to
SELL or to BUY.
The assumption is
that the BUY action depends on a combination of the other dependent inputs, plus some coefficients.
mahout trainlogistic --input $WORK_DIR/training/final.csv --output $WORK_DIR/model/model --target Action --predictors Open Close High --types word --features 20 --passes 100 --rate 50 --categories 2
mahout runLogistic --input $WORK_DIR/training/enter.csv --model $WORK_DIR/model/model --auc --confusion
AUC = 0.87
confusion: [[864.0, 176.0], [165.0, 933.0]]
The first parameter is the acronym for Area Under the Curve. As the area under the logistic can be
expressed as a probability, this parameter is the probability of the model that classifies the data
correctly. The AUC parameter tells us the number of true positives, so the higher the value between 0
and 1, the fewer false positives we have.
Besides, we have the confusion matrix that tells that the model performs well in 864 out of 1040 test
cases (864 + 176).
Using adaptive logistic regression
Do I need Hadoop to use Mahout?
There is a number of algorithm implementations that require no Hadoop dependencies whatsoever, consult the algorithms list. In the future, we might provide more algorithm implementations on platforms more suitable for machine learning such as Apache Spark
a recommender is a software that is able to make suggestions on new or
existing preferences from previously recorded preferences.
Read a huge amount of data that maps a user with some preferences for an item
Find an item that should be suggested to the user
mahout seqdirectory -i $WORK_DIR/original -o $WORK_DIR/sequencesfiles
Sequence files are binary encoding of key/value pairs. There is a header on the top of the file
organized with some metadata information which includes:
Version
Key name
Value name
Compression
mahout seqdumper -i $WORK_DIR/sequencesfiles/chunk-0 | more
hadoop dfs -text part-0000
Mahout gives the possibility of reading a sequence file and converting every key/value into a text
format.
Importing an external datasource into HDFS
RDBMS will be outside the machine where Sqoop is running
hdfs namenode -format
sqoop import-all-tables --connect jdbc:mysql://localhost/bbdatabank --user root -P
--verbose
sqoop –options-file <path_to_file>
hadoop fs –ls
hadoop –fs tail TEAMS
sqoop export --connect jdbc:mysql://localhost/bbdatabank --user root -P --verbose -
-export-dir /export/ --table results
sqoop job --create myimportjob -- import-all-tables --connect
jdbc:mysql://localhost/bbdatabank --user root -P --verbose
Using the Mahout text classifier to demonstrate the basic use case
Using the Naïve Bayes classifier from code
Using Complementary Naïve Bayes from the command line
Coding the Complementary Naïve Bayes classifier
Given a
dataset, that is, a set of observations of many variables, a classifier is able to assign a new observation
to a particular category.
./mahout seqdirectory -i ${WORK_DIR}/20news-all -o ${WORK_DIR}/20news-seq
To examine the outcome, you can use the Hadoop command-line option fs.
hadoop fs -text $WORK_DIR/20news-seq/chunck-0 | more
The Naïve Bayes algorithm does not work directly with the words and the raw text, but with the weighted
vector associated to the original document. So now, we need to transform the raw text into vectors of
weights and frequency.
./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf
The -lnorm parameter instructs the vector to use the L_2 norm as a distance
The -nv parameter is an optional parameter that outputs the vector as namedVector
The -wt parameter instructs which weight function needs to be used
mahout split -i 20news-vectors/tfidf-vectors --trainingOutput 20news-train-vectors --testOutput 20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow
mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o 20news-testing
mahout vectordump -i 20news-vectors/tfidf-vectors/part-r-00000 -o 20news-vectors/tfidf-vectors/part-r-00000dump
cut -d , -f 2-7 google.csv > training.csv
If the closing price of the previous day is lower than that of the current day, update the last column to
SELL or to BUY.
The assumption is
that the BUY action depends on a combination of the other dependent inputs, plus some coefficients.
mahout trainlogistic --input $WORK_DIR/training/final.csv --output $WORK_DIR/model/model --target Action --predictors Open Close High --types word --features 20 --passes 100 --rate 50 --categories 2
mahout runLogistic --input $WORK_DIR/training/enter.csv --model $WORK_DIR/model/model --auc --confusion
AUC = 0.87
confusion: [[864.0, 176.0], [165.0, 933.0]]
The first parameter is the acronym for Area Under the Curve. As the area under the logistic can be
expressed as a probability, this parameter is the probability of the model that classifies the data
correctly. The AUC parameter tells us the number of true positives, so the higher the value between 0
and 1, the fewer false positives we have.
Besides, we have the confusion matrix that tells that the model performs well in 864 out of 1040 test
cases (864 + 176).
Using adaptive logistic regression
No comments:
Post a Comment