Nutch extension points
parse-plugins.xml represents a natural ordering for which parsing plugin should get called for a particular mimeType.
ParseFilter, Parser always used together, is called in nutch parse task.
Parser implementation: HtmlParser, TikaParser, only one parser will be run for specific page
In Parser, the Webpage param is filled with value from Fetcher, such as baseUrl, metadata, contenttype, headers.
It's content is a ByteBuffer, all html page. At this title, title, text and other values are not set, is null.
The parser will parse the content, get the pure text(html tag excluded) , get the title, outlinks.
byte[] contentInOctets = page.getContent().array();
InputSource input = new InputSource(new ByteArrayInputStream(
contentInOctets));
utils.getText(sb, root);
text = sb.toString();
Parse parse = new Parse(text, title, outlinks, status);
public interface Parser extends FieldPluggable, Configurable {
Parse getParse(String url, WebPage page);
}
Permits one to add additional
metadata to parses provided by the html or tika plugins. All plugins found which implement this extension
point are run sequentially on the parse.
org.apache.nutch.parse.ParseFilter
Parse filter(String url, WebPage page, Parse parse,
HTMLMetaTags metaTags, DocumentFragment doc);
IndexingFilter: to add new field, example: index-basic, index-anchor
IndexingFilter is called in IndexerJob during index task. IndexerJob creates IndexerMapper, no reduce jobs.
At this point, baseUrl, content, contentType, outlinks are not set.
text, title are set.
public Collection<WebPage.Field> getFields() {
return FIELDS;
}
FIELDS.add(WebPage.Field.TITLE);
FieldPluggable defines the getFields() methods.
BasicIndexingFilter
doc.add("content", TableUtil.toString(page.getText()));
The getFields() told nutch GoraMapper which fields to load. We can add field into getFields() to tell nutch to load fields from its data store, Nutch will return data as long as the field is filled, and already has data.
IndexerJob collects all fields from all IndexingFilter.
Each job(such as fetcherJob, parserJob)
ParseJob collects all fields from ParseFilter.
FetcherJob collects all fields from ParseJob and ProtocolFactory.
private static Collection<WebPage.Field> getFields(Job job) {
Configuration conf = job.getConfiguration();
Collection<WebPage.Field> columns = new HashSet<WebPage.Field>(FIELDS);
IndexingFilters filters = new IndexingFilters(conf);
columns.addAll(filters.getFields());
ScoringFilters scoringFilters = new ScoringFilters(conf);
columns.addAll(scoringFilters.getFields());
return columns;
}
org.apache.nutch.storage.StorageUtils.initMapperJob(Job, Collection<Field>, Class<K>, Class<V>, Class<? extends GoraMapper<String, WebPage, K, V>>, Class<? extends Partitioner<K, V>>, boolean)
DataStore<String, WebPage> store = createWebStore(job.getConfiguration(),
String.class, WebPage.class);
Query<String, WebPage> query = store.newQuery();
query.setFields(toStringArray(fields));
GoraMapper.initMapperJob(job, query, store,
outKeyClass, outValueClass, mapperClass, partitionerClass, reuseObjects);
GoraOutputFormat.setOutput(job, store, true);
org.apache.nutch.indexer.IndexerJob.createIndexJob(Configuration, String, String)
Collection<WebPage.Field> fields = getFields(job);
StorageUtils.initMapperJob(job, fields, String.class, NutchDocument.class,IndexerMapper.class);
job.setNumReduceTasks(0);
job.setOutputFormatClass(IndexerOutputFormat.class);
SolrIndexerJob, ElasticIndexerJob
public class SolrIndexerJob extends IndexerJob {}
in Solr IndexerOutputFormat, getRecordWriter creates NutchIndexWriter, open them, and return a RecordWriter which is responsible to write doc to solr.
The writers are created at: final NutchIndexWriter[] writers =
NutchIndexWriterFactory.getNutchIndexWriters(job.getConfiguration());
public void write(String key, NutchDocument doc) throws IOException {
for (final NutchIndexWriter writer : writers) {
writer.write(doc);
}
}
NutchIndexWriter has 2 implementations: SolrWriter, and ElasticWriter.
In SolrWriter, its open creates a httpsolrserver, write method sends doc to solr.
public abstract class OutputFormat<K, V>
{
public abstract RecordWriter<K, V> getRecordWriter(TaskAttemptContext paramTaskAttemptContext)
throws IOException, InterruptedException;
public abstract void checkOutputSpecs(JobContext paramJobContext)
throws IOException, InterruptedException;
public abstract OutputCommitter getOutputCommitter(TaskAttemptContext paramTaskAttemptContext)
throws IOException, InterruptedException;
}
Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run
sequentially on the parse.
page.getText() will return
public interface IndexingFilter extends FieldPluggable, Configurable {
NutchDocument filter(NutchDocument doc, String url, WebPage page) throws IndexingException;
}
public interface FieldPluggable extends Pluggable {
public Collection<WebPage.Field> getFields();
}
Gotcha:
no inlinks in nutch.
parse-plugins.xml represents a natural ordering for which parsing plugin should get called for a particular mimeType.
ParseFilter, Parser always used together, is called in nutch parse task.
Parser implementation: HtmlParser, TikaParser, only one parser will be run for specific page
In Parser, the Webpage param is filled with value from Fetcher, such as baseUrl, metadata, contenttype, headers.
It's content is a ByteBuffer, all html page. At this title, title, text and other values are not set, is null.
The parser will parse the content, get the pure text(html tag excluded) , get the title, outlinks.
byte[] contentInOctets = page.getContent().array();
InputSource input = new InputSource(new ByteArrayInputStream(
contentInOctets));
utils.getText(sb, root);
text = sb.toString();
Parse parse = new Parse(text, title, outlinks, status);
public interface Parser extends FieldPluggable, Configurable {
Parse getParse(String url, WebPage page);
}
Permits one to add additional
metadata to parses provided by the html or tika plugins. All plugins found which implement this extension
point are run sequentially on the parse.
org.apache.nutch.parse.ParseFilter
Parse filter(String url, WebPage page, Parse parse,
HTMLMetaTags metaTags, DocumentFragment doc);
IndexingFilter: to add new field, example: index-basic, index-anchor
IndexingFilter is called in IndexerJob during index task. IndexerJob creates IndexerMapper, no reduce jobs.
At this point, baseUrl, content, contentType, outlinks are not set.
text, title are set.
public Collection<WebPage.Field> getFields() {
return FIELDS;
}
FIELDS.add(WebPage.Field.TITLE);
FieldPluggable defines the getFields() methods.
BasicIndexingFilter
doc.add("content", TableUtil.toString(page.getText()));
The getFields() told nutch GoraMapper which fields to load. We can add field into getFields() to tell nutch to load fields from its data store, Nutch will return data as long as the field is filled, and already has data.
IndexerJob collects all fields from all IndexingFilter.
Each job(such as fetcherJob, parserJob)
ParseJob collects all fields from ParseFilter.
FetcherJob collects all fields from ParseJob and ProtocolFactory.
private static Collection<WebPage.Field> getFields(Job job) {
Configuration conf = job.getConfiguration();
Collection<WebPage.Field> columns = new HashSet<WebPage.Field>(FIELDS);
IndexingFilters filters = new IndexingFilters(conf);
columns.addAll(filters.getFields());
ScoringFilters scoringFilters = new ScoringFilters(conf);
columns.addAll(scoringFilters.getFields());
return columns;
}
org.apache.nutch.storage.StorageUtils.initMapperJob(Job, Collection<Field>, Class<K>, Class<V>, Class<? extends GoraMapper<String, WebPage, K, V>>, Class<? extends Partitioner<K, V>>, boolean)
DataStore<String, WebPage> store = createWebStore(job.getConfiguration(),
String.class, WebPage.class);
Query<String, WebPage> query = store.newQuery();
query.setFields(toStringArray(fields));
GoraMapper.initMapperJob(job, query, store,
outKeyClass, outValueClass, mapperClass, partitionerClass, reuseObjects);
GoraOutputFormat.setOutput(job, store, true);
org.apache.nutch.indexer.IndexerJob.createIndexJob(Configuration, String, String)
Collection<WebPage.Field> fields = getFields(job);
StorageUtils.initMapperJob(job, fields, String.class, NutchDocument.class,IndexerMapper.class);
job.setNumReduceTasks(0);
job.setOutputFormatClass(IndexerOutputFormat.class);
SolrIndexerJob, ElasticIndexerJob
public class SolrIndexerJob extends IndexerJob {}
in Solr IndexerOutputFormat, getRecordWriter creates NutchIndexWriter, open them, and return a RecordWriter which is responsible to write doc to solr.
The writers are created at: final NutchIndexWriter[] writers =
NutchIndexWriterFactory.getNutchIndexWriters(job.getConfiguration());
public void write(String key, NutchDocument doc) throws IOException {
for (final NutchIndexWriter writer : writers) {
writer.write(doc);
}
}
NutchIndexWriter has 2 implementations: SolrWriter, and ElasticWriter.
In SolrWriter, its open creates a httpsolrserver, write method sends doc to solr.
public abstract class OutputFormat<K, V>
{
public abstract RecordWriter<K, V> getRecordWriter(TaskAttemptContext paramTaskAttemptContext)
throws IOException, InterruptedException;
public abstract void checkOutputSpecs(JobContext paramJobContext)
throws IOException, InterruptedException;
public abstract OutputCommitter getOutputCommitter(TaskAttemptContext paramTaskAttemptContext)
throws IOException, InterruptedException;
}
Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run
sequentially on the parse.
page.getText() will return
public interface IndexingFilter extends FieldPluggable, Configurable {
NutchDocument filter(NutchDocument doc, String url, WebPage page) throws IndexingException;
}
public interface FieldPluggable extends Pluggable {
public Collection<WebPage.Field> getFields();
}
Gotcha:
no inlinks in nutch.