Pages

Friday, April 5, 2013

Setting up elasticsearch


Built over apache lucene, elastic search delivers amazing power as a search and analytics engine. It allows for JSON based interactions over RESTful HTTP, which is really convenient. It has a robust distributed structure and becomes an ideal solution for the cloud. Make it open source and you are staring at a beast. (Wipe that drool off your face!)

All this power is bound to make souls wanting to experiment with it a little queasy. Specially if you have no background in data mining, or any experience with lucene, it can take longer than usual to get things rolling. I fall in that very category. Moreover, the Java api doc is scanty and leaves much to imagination.

If elasticsearch teases you too, then read on, maybe it'll make the road smoother for you.

As we go along glancing over some basics, we'll build a small java application (part II of this post soon) to index some documents into elasticsearch and test our index using curl. Once this is up and running, you can build on from there and father/mother a beautiful little search application.

Setting up elastic search is fairly easy. Download and dump! I'll use windows! It'll work the same with Linux/Mac, whatever your poison is. You can pick up the bundle from here.


Basic Configuration

ES comes pre packaged with some basic configuration that's good to go even for a simple production deployment, so you can either not touch the config file at all, or tweak some options just for fun!

The config file elasticsearch.yml is nested under $ES_HOME\config. ($ES_HOME points to your ES root folder) It is pretty much self explanatory once you start reading through it. Top down, you'll encounter the following sections:-


Cluster

If you are on a network, or even locally if you are running multiple projects having their own elasticsearch setup, you might want to change the default clustername from 'elasticsearch' to, say your favorite action movie name so that when you start your own node, it does not merge into some already live cluster running under the same default name.

cluster.name: savingprivateryan

Node

You can safely omit setting up the node.name for the time being as ES will generate one for you dynamically and most likely is a marvel comic hero name..or not. (We all have our quirks!)

If you want this node to never be a master which most likely means that you dont want it taking care of the cluster maybe because it lacks processing power or you simply are a tyrant, set

node.master: false

If stripping it off its power doesnt satisfy your ego and you want it to never hold any data too, then set

node.data: false

ES lets you start multiple nodes from the same installation. You can set an upper limit to it if you want

node.max_local_storage_nodes: 1

Read the comments slapped all around this section in the actual config file, and reading all this so far would just seem redundant. But then again, didn't I tell you that it's self explanatory!


Index

You can safely skip configuring this section! I say so because any settings done in this part would apply to all your indices under this node. I am assuming you wouldn't want that. We can set these properties individually for indices we create dynamically using the java api. But its always good to know what these do!

index.number_of_shards : Number of shards you'd want your index to break up into. These get distributed to nodes as and when they join your cluster, so as to load balance. For a local dev setup, you can set it to 1 since you'll have just one machine and hopefully you wouldn't want to run multiple nodes under a cluster on the same machine.

index.number_of_replicas : Number of copies of the entire index you want. This acts as a failsafe, provides data redundancy. So if lets say you have 5 shards and 1 replicas, and you create an index. It'll get broken up into 5 shards, and ES will replicate that entire shard bunch 1 times. You effectively have 10 shards now, which you can load balance between 10 nodes; if you decide to add more nodes to your cluster that is.

Do keep in mind that shard setting is a one time setting per index, whether set through config file (gets applied to all indices), or dynamically through the api (applied per index). Once set per index, it can not be changed. However, number of replicas can still be manipulated using the update settings api.

If unspecified, all indexes are created with 5 shards, and 1 replicas by default.


Paths

Make path.data point to where you'd want ES to store your indices. Do similarly for path.logs and path.work. By default they are created inside your ES root folder. If you plan on working with good amount of data, which you probably are, point path.data to a location which you know has ample space.


Network & HTTP

If you want to do change something in this section of the file, then you are probably looking at the wrong blog. This section lets you change defaults of ports where ES listens for HTTP/ TCP requests, which are 9300 and 9200 by default, also lets you bind a specific IP to your current node.

The first node you'll create will use these default ports. Any node created afterwards on the same host, will look for the next available port starting from these defaults. i.e 9301 & 9201. Shutting down a node will free up these ports immediately.


Discovery

By default ES uses multicast to discover nodes in a network and also to elect the master. If multicast is disabled for some reason on your network, or you want to avoid the unnecessary chatter caused by it, or you simply don't care because you just want to run it locally, set

discovery.zen.ping.multicast.enabled: false

Nodes you'll add would still need to discover each other for a properly functioning cluster, so do enable unicast and specify what all nodes will be used for the discovery process. For local setup, point discovery.zen.ping.unicast.hosts to localhost.

discovery.zen.ping.unicast.hosts: ["localhost"]

As you can see it accepts an array of nodes, just in case you want more nodes to participate.

We've skipped some sections of the configuration, as they deal with advanced options and this is only an initiation for a novice user. We're not going to configure anything in our setup as there is no major need to tweak anything. However, i do suggest you play around a little with these settings to get a hang of it.

I'll add another post soon about building a small application using the java api of ES. Do let me know if you find any loops in this post or if you have any questions!

No comments:

Post a Comment