Pages

Wednesday, April 17, 2013

Indexing into elasticsearch

This is a 10 minute tutorial that demonstrates how to build a java app that indexes data into elasticsearch and the use of curl to query the index.

Making any changes to the default elasticsearch configuration will not be required, but if you would like to tweak a few options here & there, you might want to go through an earlier post on configuring elasticsearch.



Sample data

We'll index some superheroes along with their date of births and their favorite eatable!


Captain America,19400101,chicken biryaani
Hulk,20050420,kadai paneer
Shaktimaan,19801210,parle G
Ghost Rider,20011010,chicken biryaani


Put these under superheroes.csv in some location.

We'll call our index as "myindex" and the type as "superheroes".
Index and types in ES are analogous to databases and tables in SQL.


Where is my java client?

Building a java client if fairly straightforward.

The JARs you'll need come packaged with the ES bundle. Pull them off your $ES_HOME\lib folder. For this example, we'll make do with the following ones:-

elasticsearch-0.20.5
lucene-analyzers-3.6.2
lucene-core-3.6.2
lucene-highlighter-3.6.2
lucene-memory-3.6.2
lucene-queries-3.6.2

Or if working with maven, simply include the following dependency in your pom

<dependency>
    <groupId>org.elasticsearch</groupId>
    <artifactId>elasticsearch</artifactId>
    <version>0.20.6</version>
</dependency>

Version may differ for you, depending on what release you're on.

ES accepts documents as JSON. So we need to break up our records into key value pairs. We can use openCSV to parse superheroes.csv and then convert it to JSON. Include the jar for it or add the following dependency to your pom.

<dependency>
    <groupId>net.sf.opencsv</groupId>
    <artifactId>opencsv</artifactId>
    <version>2.3</version>
</dependency>

Once you have that ready, there are a couple of ways you can generate the JSON document ES consumes.
You can either manually construct a JSON string, or use a Map which gets translated to JSON automatically by the api, or use jackson, or use the XContentFactory.jsonBuilder() packaged under ES. Since this tutorial is all ES, we’ll use the latter most option of 'jsonBuilder'. (Totally random decision, no performance considerations)

CSVReader reader = new CSVReader(
    new FileReader(INPUT_FILENAME));
List<String> documents = reader.readAll();

for (String [] document : documents){
    XContentBuilder builder = jsonBuilder()
        .startObject()
            .field("name", document[0])
            .field("dob", document[1])
     .field("favFood", document[2])
        .endObject();
    
    client.prepareIndex(INDEX_NAME, TYPE)
        .setSource(builder)
        .execute()
        .actionGet();
}

We can either start a node and connect to the cluster or use the transport client to connect to our cluster remotely. Starting a node will be an overkill for this example as it will become a part of the cluster and then they both would probably start chatting. So we'll use the transport client and connect on port 9300, which is the default port that ES uses for tcp connections.

TransportClient client = new TransportClient(
    ImmutableSettings
    .settingsBuilder()
    .put("cluster.name", "elasticsearch")
    .build())
    .addTransportAddresses(new InetSocketTransportAddress(
        "localhost", 9300));

A lot more settings can be done besides just assigning a cluster name as I've done here, but maybe a separate post regarding that. Anywho, we prepare our index by passing it the index name,  type-name and an Id for the document. The Id is not mandatory and if not specified, ES automatically generates one at time of indexing.

IndexResponse response = client
    .prepareIndex(INDEX_NAME, TYPE, ID)
    .setSource(builder)
    .execute()
    .actionGet();

Another way to do this is, which is how I do it, mostly out of convenience, is to use Jettison to construct a json object and add the fields of the document to it. Stringify this json and pass the string to setSource method.

JSONObject jsonObject = new JSONObject();
jsonObject.add("name", document[0]);
jsonObject.add("dob", document[1]);
jsonObject.add("favFood", document[2]);
String jsonString = jsonObject.toString();

client.prepareIndex(INDEX_NAME, TYPE)
    .setSource(jsonString)
    .execute()
    .actionGet();

Adding up all the snippets and filling the gaps we get the following class

import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;

import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.List;

import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.common.settings.ImmutableSettings;
import org.elasticsearch.common.transport.InetSocketTransportAddress;
import org.elasticsearch.common.xcontent.XContentBuilder;

import au.com.bytecode.opencsv.CSVReader;
public class MyIndexer {

 private static final String INPUT_FILENAME = "<path-to-your-inputfile>\\superheroes.csv";
 private static final String CLUSTER_NAME = "elasticsearch";
 private static final String INDEX_NAME = "myindex";
 private static final String TYPE = "superheroes";

 
 public static void main(String[] args) {

  TransportClient client = new TransportClient(ImmutableSettings
    .settingsBuilder()
    .put("cluster.name", "elasticsearch")
    .build())
    .addTransportAddresses(new InetSocketTransportAddress("localhost", 9300));
  
  
  try {
   CSVReader reader = new CSVReader(new FileReader(INPUT_FILENAME));
   List<string> documents = reader.readAll();
   
   for (String [] document : documents){
    XContentBuilder builder = jsonBuilder()
      .startObject()
       .field("name", document[0])
       .field("dob", document[1])
       .field("favFood", document[2])
      .endObject();
    
    client.prepareIndex(INDEX_NAME, TYPE).setSource(builder).execute().actionGet();
   }
   
   reader.close();
   client.close();
  
  } catch (FileNotFoundException e) {
   e.printStackTrace();
  } catch (IOException e) {
   e.printStackTrace();
  }
 }
}


Starting the server

Starting with a fresh ES dump, and as I mentioned in the beginning of this post, we're not going to make any alterations to the default configuration it provides. Open cmd prompt and browse to $ES_HOME\bin and run elasticsearch bat file. You'll see some logs, splashing some information about the ports on which ES is listening to for HTTP/TCP connections, name of the cluster maybe and finally a message notifying you that your node has started.

Note that if you haven't specified any clustername explicitly in your configuration file, by default it takes 'elasticsearch' as the name.


Searching the index

Run the client application, and it'll index our sample documents from superheroes.csv. Use curl to query your newly created index. ES's query engine is really powerful and you can build a lot of complex queries using it. Here are a few to help you get acquainted if not already so.

# 1. Retrieve all documents, 10 at a time (get all)
curl -XGET localhost:9200/myindex/superheroes/_search

# 2. Find superheroes based on name (term query)
curl -XGET localhost:9200/myindex/superheroes/_search -d '
{
  "query" : {
    "term" : { "name" : "hulk" }
  }
}'

# 3. Find superheroes who like chicken biryaani
# (full text match)
curl -XGET localhost:9200/myindex/superheroes/_search -d '
{
  "query" : {
    "match_phrase" : {
      "favFood" : "chicken biryaani"
    }
  }
}'

# 4. Find superheroes born in the 20th century (range query)
curl -XGET localhost:9200/myindex/superheroes/_search -d '
{
  "query" : {
    "range" : {
      "dob" : {
        "from" : "19000101",
        "to" : "19991231"
      }
    }
  }
}'

Some of the important fields to note in the result are 'total' (under the 'hits' json object), which tells you about the total number of matched documents returned and '_source' (under the 'hits' array), which houses the matched document.

Hope this was helpful!

No comments:

Post a Comment