
Understanding Bulk Indexing in Elasticsearch
Posted by Timo Goosen May 22, 2018Bulk indexing in Elasticsearch is an important topic to understand because you might occasionally need to write your own code to bulk index custom data. In addition, experience with bulk indexing is important when you need to understand performance issues with an Elasticsearch cluster.
Bulk indexing in Elasticsearch is an important topic to understand because you might occasionally need to write your own code to bulk index custom data. In addition, experience with bulk indexing is important when you need to understand performance issues with an Elasticsearch cluster.
When I started out with Elasticsearch, I found it very frustrating that there were no articles that provided a reference point for bulk indexing. I had to reference several articles and books before I figured out how simple it actually is to bulk index data.
Accordingly, if you want to understand how bulk indexing works under the hood in Elasticsearch, it will take a bit more effort than merely reading this article — although this is certainly a good place to start!
If you are planning to do bulk indexing, there are some important considerations that you will have to consider before you do anything. For example, usually, if you are going to create an index and set a mapping, you will have to first create the index with that mapping. If you are using the bulk index API, then you don’t have to actually create the index because the index name will be part of the data sent to Elasticsearch. That is, the index will be automatically created.
However, if you want to set a specific mapping, you will have to create the index with that mapping first. In addition, something else to consider is that you shouldn’t be doing any queries / searches on the cluster while indexing data via the bulk index API. Doing so can cause significant performance issues.
When you index data using the bulk index API, the data needs to follow a specific structure as seen here in this screenshot from my text editor:
{ "index" : { "_index" : "testindex", "_type" : "somerandomtype", "_id" : "1" } } { "somefield" : "value1" } { "index" : { "_index" : "testindex", "_type" : "somerandomtype", "_id" : "2" } } { "somefield" : "hello hello hello" } { "index" : { "_index" : "testindex", "_type" : "somerandomtype", "_id" : "3" } } { "somefield" : "Whoo WHoo hooo hooo hoooooooo hoooo" } { "index" : { "_index" : "testindex", "_type" : "somerandomtype", "_id" : "4" } } { "somefield" : "Really need the water in Cape Town" }
The format looks something like this:
If we decided to index this data into Elasticsearch, then we could do it from the command line with curl
using the following command:
$ curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/_bulk --data-binary @request_example.json
Here is a screenshot of the output of running the above command:
Correspondingly, this is how the data looks like if we view it in Kibana:
This format is a bit hard to understand but I will explain it shortly. Once you understand this format and why it is used by Elasticsearch, it makes bulk indexing much easier. The format looks something like this:
{ action_to_be_performed: { metadata_related_to_action_performed }}\newline { request_body_usually_data_to_be_indexed }\newline
Here is an example:
1. The action to be performed here is index because we are going to index data. (In total, there are four action verbs to understand: create, index, update, and delete.)
'{ "index": … }’
2. The index we are indexing the data into is “testindex” We don’t need to create the index because as long as we specify the index name in the metadata of the related action to be performed, the index will be created during the bulk indexing operation.
... { "_index": "testindex", ...
3. We need to specify the type by which the data will be indexed. In this case it will be of type “car” (If you need to refresh your memory, see the Qbox article “All about the Sense Web Application”). If the equivalent for an Elasticsearch index is a database, then a “type” is the Elasticsearch equivalent of a Table (I will return to this later).
..."_type": "somerandomtype" }...
4. Your actual data to index into Elasticsearch is called the request body.
{ "somefield" : "value1" }
Let’s put it all together so you can see what everything looks like. This is the metadata line and request body for one document to be indexed:
{ "index" : { "_index" : "testindex", "_type" : "somerandomtype", "_id" : "1" } } { "somefield" : "value1" } ...
This example shows an optional _id
field in themetadata_related_to_action_performed
section of the request. However, you can just specify the _type
and the _index
, and Elasticsearch will automatically generate the ID values.
Let’s expand our bulk indexing example a little bit without specifying _id
, and using some more interesting example data for the request body. Here is our sample data to index:
{ "index" : { "_index" : "countries", "_type" : "country"} } { "country_name": "South Africa","continent" : "Africa", "country_abbreviation": "ZA" } { "index" : { "_index" : "countries", "_type" : "country"} { "country_name": "Germany","continent" : "Europe", "country_abbreviation": "DE" } { "index" : { "_index" : "countries", "_type" : "country"} { "country_name": "United States","continent" : "America", "country_abbreviation": "USA" }
Here is a screenshot of the data in my text editor:
Now, I can index this data using curl
like this (after I first saved it in a file called country_bulk.json
):
$ curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/_bulk --data-binary @country_bulk.json
The above command will produce the following output:
If you were using Elasticsearch prior to version 6 and you have some experience with bulk indexing data, then take note that in Elasticsearch 6 it is compulsory to specify the content type of the body. Therefore, always remember to specify the content type this way (For more information about these changes go here.)
$ curl -s -H "Content-Type: application/x-ndjson" ...
$ wget https://github.com/elastic/elasticsearch/blob/master/docs/src/test/resources/accounts.json?raw=true
Rename it:
$ mv accounts.json\?raw\=true accounts.json
Remove the bulk index action and related metadata using AWK:
$ awk 'NR%2==0' accounts.json > accounts2.json
After these manipulations, accounts2.json
should look like this:
We use the following script to help add the metadata to allow this data to be indexed via the bulk index API:
#!/usr/bin/env python3 filepath = 'accounts2.json' metadata='{ "index": { "_index": "bank_account", "_type": "account" }}' with open(filepath, mode="r",encoding="utf-8") as my_file: for line in my_file: print(metadata) print(line.rstrip("\n"))
Here is a screenshot of the script in vim. The line that uses rstrip
is there to remove the newline that the Python interpreter inserts when printing the values to the screen.
Now you can run this script. Just save it as script.py
$ python3 script.py > out.json
Now let’s have a look at the output:
$ vim out.json
Here is a screenshot of the output that was produced:
You can now go ahead and index the data with the following command:
$ curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/_bulk --data-binary @out.json
Now we can retrieve two documents from the indexed data using the bulk index API:
$ curl -XGET 'http://127.0.0.1:9200/bank_account/_search?pretty&size=2' | jq '.'
See below the output that you can expect to see from running the command above:
Conclusion
In this guide, we explored how bulk indexing works in Elasticsearch and how to bulk index a raw dataset with Elasticsearch 6. (Note that ES 6 introduced a change that forces users to specify the content type when making a request that includes a request body.) We focused here specifically on how the data should be formatted for bulk indexing and demonstrated bulk indexing operations on several datasets. We hope that after you have worked through this guide you’ve learned how to bulk index a raw dataset and now have a better understanding how bulk indexing works under the hood.