JYuan Learning Log: Learning Solr Cloud

Solr Cloud
http://wiki.apache.org/solr/SolrCloud
java -Djetty.port=8983 -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar

cd exampleB
java -Djetty.port=8900 -DzkHost=localhost:9983 -jar start.jar
cd example2B
java -Djetty.port=7500 -DzkHost=localhost:9983 -jar start.jar

shards.tolerant=true
SolrCloud uses leaders and an overseer as an implementation detail. This means that some nodes/replicas will play special roles. You don't need to worry if the instance you kill is a leader or the cluster overseer - if you happen to kill one of these, automatic fail over will choose new leaders or a new overseer transparently to the user and they will seamlessly takeover their respective jobs. Any Solr instance can be promoted to one of these roles.

Every zookeeper server needs to know about every other zookeeper server in the ensemble, and a majority of servers are needed to provide service.
The default is to run an embedded zookeeper server at hostPort+1000.

Two shard cluster with shard replicas and zookeeper ensemble

cd example
java -Djetty.port=8983 -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2 -jar start.jar

cd example2
java -Djetty.port=7574 -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar

cd exampleB
java -Djetty.port=8900 -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar

cd example2B
java -Djetty.port=7500 -DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar

Multiple Zookeeper servers running together for fault tolerance and high availability is called an ensemble.
ZooKeeper works best when it has a dedicated machine. ZooKeeper is a timely service and a dedicated machine helps ensure timely responses. A dedicated machine is not required however.
ZooKeeper works best when you put its transaction log and snap-shots on different disk drives.
If you do colocate ZooKeeper with Solr, using separate disk drives for Solr and ZooKeeper will help with performance.

Create http://localhost:8983/solr/admin/collections?action=CREATE&name=mycollection&numShards=3&replicationFactor=4
http://localhost:8983/solr/admin/collections?action=DELETE&name=mycollection

Split Shard http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=<collection_name>&shard=shardId

Collection Aliases
CreateAlias http://localhost:8983/solr/admin/collections?action=CREATEALIAS&name=alias&collections=collection1,collection2,…
DeleteAlias http://localhost:8983/solr/admin/collections?action=DELETEALIAS&name=alias

Creating cores via CoreAdmin
curl 'http://localhost:8983/solr/admin/cores?action=CREATE&name=mycore&collection=collection1&shard=shard2'

Distributed Requests

Query all shards of a compatible collection, explicitly specified:

http://localhost:8983/solr/collection1/select?collection=collection1_recent
Query all shards of multiple compatible collections, explicitly specified:
http://localhost:8983/solr/collection1/select?collection=collection1_NY,collection1_NJ,collection1_CT

http://localhost:8983/solr/collection1/select?shards=shard_200812,shard_200912,shard_201001
http://localhost:8983/solr/collection1/select?shards=localhost:8983/solr,localhost:7574/solr
http://localhost:8983/solr/collection1/select?shards=localhost:8983/solr|localhost:8900/solr,localhost:7574/solr|localhost:7500/solr

-DzkRun Starts up a ZooKeeper server embedded within Solr. This server will manage the cluster configuration
zkHost=<ZooKeeper Host:Port>,
-DnumShards Determines how many pieces you're going to break your index into.
-Dbootstrap_confdir ZooKeeper needs to get a copy of the cluster configuration, so this parameter tells it where to find that information.
-Dcollection.configName This parameter determines the name under which that configuration information is stored by ZooKeeper.

The -DnumShards, -Dbootstrap_confdir, and -Dcollection.configName parameters need only be specified once,
the first time you start Solr in SolrCloud mode. They load your configurations into ZooKeeper; if you run them again at a later
time, they will re-load your configurations and may wipe out changes you have made.

With Replicas
The
architecture consists of the original shards, which are called the leaders, and their replicas, which contain the same data but let the leader handle
all of the administrative tasks such as making sure data goes to all of the places it should go. This way, if one copy of the shard goes down, the
data is still available and the cluster can continue to function.

the cluster already knew that there were only two shards and they were already accounted for, so new nodes are added as
replicas.

the course of events goes like this:
Replica (in this case the server on port 7500) gets the request.
Replica forwards request to its leader (in this case the server on port 7574).
The leader processes the request, and makes sure that all of its replicas process the request as well.

Using Multiple ZooKeepers in an Ensemble
A ZooKeeper ensemble can keep running as long as more than half of its servers are up and running

the order of the parameters matters. Make sure to specify the -DzkHost parameter after the other ZooKeeper-related
parameters.

How SolrCloud Works
With SolrCloud, a
single index can span multiple Solr instances. This means that a single index can be made up of multiple cores on different machines.
The cores that make up one logical index are called a collection. A collection is a essentially a single index that can span many cores, both for
index scaling as well as redundancy.

Collections can be divided into slices. Each slice can exist in multiple copies; these copies of the
same slice are called shards. One of the shards within a slice is the leader, designated by a leader-election process. Each shard is a physical
index, so one shard corresponds to one core.

A cluster is set of Solr nodes managed by ZooKeeper as a single unit.
A cluster is created as soon as you have more than one Solr instance registered with ZooKeeper.

Resizing a Cluster
Clusters contain a settable number of shards. You set the number of shards for a new cluster by passing a system property, numShards, when
you start up Solr. The numShards parameter must be passed on the first startup of any Solr node, and is used to auto-assign which shard each
instance should be part of. Once you have started up more Solr nodes than numShards, the nodes will create replicas for each shard, distributing
them evenly across the node, as long as they all belong to the same collection.

The number of shards determines how the data in your index is broken up, so you cannot change the number of shards of the index after initially
setting up the cluster

However, you do have the option of breaking your index into multiple shards to start with.

Shards and Indexing Data in SolrCloud
several problems with the distributed approach that necessitated improvement with SolrCloud:
Splitting of the core into shards was somewhat manual.
There was no support for distributed indexing, which meant that you needed to explicitly send documents to a specific shard; Solr couldn't
figure out on its own what shards to send documents to.
There was no load balancing or failover, so if you got a high number of queries, you needed to figure out where to send them and if one
shard died it was just gone.

SolrCloud fixes all those problems. There is support for distributing both the index process and the queries automatically, and ZooKeeper
provides failover and load balancing. Additionally, every shard can also have multiple replicas for additional robustness.

there are leaders and replicas. Leaders are automatically elected, initially on
a first-come-first-served basis, and then based on the Zookeeper process described at http://zookeeper.apache.org/doc/trunk/recipes.html#sc_le
aderElection..
If a leader goes down, one of its replicas is automatically elected as the new leader. As each node is started, it's assigned to the shard with the
fewest replicas. When there's a tie, it's assigned to the shard with the lowest shard ID.

When a document is sent to a machine for indexing, the system first determines if the machine is a replica or a leader.
If the machine is a replica, the document is forwarded to the leader for processing.
If the machine is a leader, SolrCloud determines which shard the document should go to, forwards the document the leader for that
shard, indexes the document for this shard, an d forwards the index notation to itself and any replicas.

Document Routing
Shard Splitting
The ability to split shards is in the Collections API. It currently allows splitting a shard into two pieces. The existing shard is left as-is, so the split
action effectively makes two copies of the data as new shards. You can delete the old shard at a later time when you're ready.

Distributed Requests
http://localhost:8983/solr/collection1/select?q=*:*&shards=localhost:7574/solr
You also have the option of searching multiple collections:
http://localhost:8983/solr/collection1/select?collection=collection1,collection2,colle

ction3

JYuan Learning Log

Friday, December 20, 2013

Learning Solr Cloud

No comments:

Post a Comment