Pages

Tuesday, June 18, 2013

What are mappings in elasticsearch

Elasticsearch defaults most of the parameteres for you to reduce setup time, and you can change things as and when the need arises. So far we haven't bothered about the document schema and have been pushing in documents to be indexed as it is. We've let ES process the fields of our documents as per its own wish and yes it has done a really good job, thanks to its neat set of defaults, but ofcourse as things start getting complicated, one would crave for greater control.


Scenario 1

We decide to use ES for what its meant for in its true sense. A damn good indexing tool! So we decide to relieve it of the burden to store entire documents and instead use something like mongodb as our document store. We do index entire documents in ES but make it store just the ids. So when we query against ES, only the ids of matched documents are retuned instead of the entire document, and we use these ids to retrieve documents from mondodb. Sounds neat, but how to do convery this to ES?

Scenario 2

Our documents are fairly huge and we do not plan on querying against all its fields. Which means that we do not need to index the entire document, instead just the parts which we plan on querying. How do we tell ES to skip some fields and index the others?


Mappings

Creating mappings is how we make it happen! Mapping is the way to tell ES how it should treat the fields of our document when indexing and searching. Lets play with its api a bit and see what all cool things can we squeeze out of it!


Providing some sort of type safety to our fields

When we create an empty index, no mappings are created. Upon insertion of the first document the mapping is automatically created for us. ES tries to make intelligent guesses about the types of the fields of the document.

Lets wipe off 'myindex' that we've been playing around with so far and start with a clean slate


$ curl -XDELETE localhost:9200/myindex/

#Result
{"ok":true,"acknowledged":true}

Inserting a fresh document

curl -XPOST localhost:9200/myindex/superheroes/ -d '
{
  name : "Iron Man",
  dob : 20130101,
  favFood : "iron ore",
  tenure : "25"
}'

#Result
{"ok":true,"_index":"myindex","_type":"superheroes",
"_id":"nKIx313SSAu9ZxW-Lyl6dA","_version":1}

We've added another field this time to our superheroes type namely tenure, which keeps track of the years of service rendered as a superhero. Even though we deleted the index and type, the POST of a new document will create them again with default mappings. Lets see what mappings has it created for us?

$ curl -XGET localhost:9200/myindex/superheroes/_mapping?pretty=true
{
  "superheroes" : {
    "properties" : {
      "dob" : {
        "type" : "long"
      },
      "favFood" : {
        "type" : "string"
      },
      "name" : {
        "type" : "string"
      },
      "tenure" : {
        "type" : "string"
      }
    }
  }
}

Hmm, interesting! So ES treats numbers specified to be of 'long' type if they aren't enclosed within quotes (dob), otherwise, it treats them as strings (tenure). Lets correct that and make ES treat the tenure field as a long too. We use the update mappings api for that

curl -XPUT localhost:9200/myindex/superheroes/_mapping -d '
{
 superheroes : {  //#---> specify the type name here
  properties : {
   tenure : { type : "long" }
  }
 }
}'

#Result
{"error":"MergeMappingException[Merge failed with failures
{[mapper [tenure] of different type, current_type [string],
merged_type [integer]]}]","status":400}

Grrr! So ES wont allow us to change the type once set! Alright, no harm done, we'll just wipe off our index and recreate it again with no quotes around the tenure value this time

$ curl -XDELETE localhost:9200/myindex/
$ curl -XPOST localhost:9200/myindex/superheroes/ -d '{
name : "Iron Man",
dob : 20130101,
favFood : "iron ore",
tenure : 10
}'
$ curl -XGET localhost:9200/myindex/superheroes/_mapping?pretty=true
{
  "superheroes" : {
    "properties" : {
      "dob" : {
        "type" : "long"
      },
      "favFood" : {
        "type" : "string"
      },
      "name" : {
        "type" : "string"
      },
      "tenure" : {
        "type" : "long"
      }
    }
  }
}

Cool, we did it! Tenure is 'long'. But we know that the tenure isn't going to be a huge number as its value would not exceed a 100 years assuming our superheroes are mortals. So, we'd rather have it as an integer. But that we can not leave it to ES to figure out by way of its 'default mapping guessing mechanism' as ES will treat any number as a 'long' by default. We need to specify it ourselves! So we delete our index for the last time and create the mappings prior to insertion of our document.

$ curl -XDELETE localhost:9200/myindex/
$ curl -XPOST localhost:9200/myindex

We created the index here separately this time because in order to apply mappings to an index, it should exist first. ES wont create one automatically if one does not exist when we apply our mappings(Not so sure about this part – will check and update here). There is a work around for that, but lets just stick to this flow for now. And while we're at it, lets dictate our dob field to be treated as a date!

$ curl -XPUT localhost:9200/myindex/superheroes/_mapping -d '
{
  superheroes : {
    properties : {
      name : { type : "string" },
      dob : { type : "date" },
      favFood : { type : "string" },
      tenure : { type : "integer" }
    }
  }
}'

#Result
{"ok":true,"acknowledged":true}

$ curl -XGET localhost:9200/myindex/superheroes/_mapping?pretty=true
{
  "superheroes" : {
    "properties" : {
      "dob" : {
        "type" : "date",
        "format" : "dateOptionalTime"
      },
      "favFood" : {
        "type" : "string"
      },
      "name" : {
        "type" : "string"
      },
      "tenure" : {
        "type" : "integer"
      }
    }
  }
}

So much drama, but for what? I am going to insert a document and make sure the type of one of the fields does not adhere to what we've specified in the mapping and see if ES raises an alarm

$ curl -XPOST localhost:9200/myindex/superheroes/ -d '{
  name : "Spider Man",
  dob : 20130101,
  favFood : "iron ore",
  tenure : "10"  # specifying it as a string
}'

#Result
{"ok":true,"_index":"myindex","_type":"superheroes",
"_id":"YqyTJGvsRbSs9WEFZAUgnA","_version":1}

Strange! Then whats the point of specifying a field type as integer, if its casually going to accept a string too. If that is the case, it wouldn't be surprising to see it accept "hello world" as tenure value too!

$ curl -XPOST localhost:9200/myindex/superheroes/ -d '{
  name : "Super Man",
  dob : 20130101,
  favFood : "hash brownies",
  tenure : "hello world"
}'

#Result
{"error":"MapperParsingException[failed to parse [tenure]];
nested: NumberFormatException[For input string: \"hello 
world\"]; ","status":400}

Huh!? o_O

What have we learnt so far:-

  1. Once type mappings are created, you can not change them. You can still add new ones, which we haven't tried here but i am sure it should work. So if we need to change the type mappings, we recreate our index.
  2. You can supply a string for an integer type field, and it'll parse it if it really is parsable i.e an integer disguised as a string. A string like 'hello world' is not parsable as a number and will be rejected. You can expriment more with other types too and add conclusions to this small list here.
We'll talk more about types maybe in a separate post. Moving on!


Removing non searchable fields from the indexing process

By default all fields are indexed. Suppose our application does not wish to ever search on the superhero's tenure. We simply let ES know that we'd like to skip tenure when indexing documents.

$ curl -XPUT localhost:9200/myindex/superheroes/_mapping -d '
{
  superheroes : {
    properties : {
      name : { type : "string" },
      dob : { type : "date" },
      favFood : { type : "string" },
      tenure : { type : "integer", index : "no" }
    }
  }
}'

#Result
{"ok":true,"acknowledged":true}

$ curl -XPOST localhost:9200/myindex/superheroes/ -d '{
  name : "Spider Man",
  dob : 20130101,
  favFood : "Wada pao",
  tenure : 10
}'

Since there is no index on tenure, we wont be able to query on it now

$ curl -XGET localhost:9200/myindex/superheroes/_search?pretty=true -d '
{
query : { term : { tenure : 10 } }
}'
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }

Indeed the query did not fetch out the 'Spider Man' document! You can set a field as not_analyzed too, but again, we'll deal with that in a later post.


Storing specific fields

By default ES stores the entire document we index in the '_source' field. We can suppress this behavior and specify what all fields we would want ES to store.

$ curl -XPUT localhost:9200/myindex/superheroes/_mapping -d '
{
  superheroes : { 
    "_source" : { 
      "enabled" : false
    },
    properties : {
      name : { type : "string" , store : "yes"},
      dob : { type : "date" },
      favFood : { type : "string" },
      tenure : { type : "integer" }
    } 
  }
}'

$ curl -XPOST localhost:9200/myindex/superheroes/ -d '{
  name : "Iron Man",
  dob : 20130101,
  favFood : "iron ore",
  tenure : 10
}'

#Result
{"ok":true,"_index":"myindex","_type":"superheroes",
"_id":"6rZTLmniQeGJiAh_ThqvZw","_version":1}

Now let me point out something here. The stored fields are not returned by default, we need to ask for them explicitly in our queries. If we query on the 'Spider Man' document and ask for some fields, ES wont give us the source field in the result (as we've disabled it in the above snippet), only the '_id' field will be returned along with the stored fields we've asked for.

$ curl -XGET localhost:9200/myindex/superheroes/_search?pretty=true -d '
{
  fields : [ "name" , "dob"],
  query : { 
    term : { tenure : 10 }
  }
}'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "myindex",
      "_type" : "superheroes",
      "_id" : "6rZTLmniQeGJiAh_ThqvZw",
      "_score" : 1.0,
      "fields" : {
        "name" : "Iron Man"
      }
    }]
  }
}

For obvious reasons 'dob' wasn't returned as we did not specify in the mapping for it to be stored.

Keep in mind that there is trade-off here. You can either keep the entire source instead of storing individual fields. In which case you end up with a larger index size, but save time on retrievals since there is just one disk seek involved for retrieving the source field. And with the '_source' in memory, we can extract individual fields using the dot notation, which is a rather quick operation. However, when individually stored fields are asked for in a query, every fields requires its own disk seek. So depending upon the size of your documents and the number of fields you'd want in your results, you need to decide whether to store the entire document source, or just store the required individual fields.


Use of java api to set mappings

The curl and java snippets underneath produce the same mappings.

$ curl -XPUT localhost:9200/myindex/superheroes/_mapping -d '
{
 superheroes : {
  "_source" : { "enabled" : false },
  properties : {
   name : { type : "string" , store : "yes"},
   dob : { type : "date" },
   favFood : { type : "string" },
   tenure : { type : "integer" }
  }
 }
}'
public class ESMapper {
  private Client client;
 
  public Client getClient() {
    return client;
  }
  public void setClient(Client client) {
    this.client = client;
  }
 
  public ESMapper(){
    setClient(new TransportClient().addTransportAddress(new InetSocketTransportAddress("localhost", 9300)));
  }

  public XContentBuilder buildMappings() throws IOException{
    XContentBuilder mappings = XContentFactory.jsonBuilder()
      .startObject()
        .startObject("superheroes")
          .startObject("_source")
            .field("enabled", false)
          .endObject() 
          .startObject("properties")
            .startObject("name")
              .field("type", "string")
              .field("store", "yes")
            .endObject()
            .startObject("dob")
              .field("type", "date")
            .endObject()
            .startObject("favFood")
              .field("type", "string")
            .endObject()
            .startObject("tenure")
              .field("type", "integer")
            .endObject()
          .endObject()
        .endObject()
      .endObject();
    return mappings;
  }

  public static void main(String[] args) throws IOException {
    ESMapper esMapper = new ESMapper();
    Client client = esMapper.getClient();
 
    client.admin().indices().prepareCreate("myindex").addMapping("superheroes", esMapper.buildMappings()).execute().actionGet();
  }
}

I skipped a few things here and there but i'll surely take them up later. This should give you a basic understanding of what mappings are. Do try out the examples here and if you face issues or spot any goofups, drop a comment below!

No comments:

Post a Comment