How we reindexed 36 billion documents in 5 days within the same Elasticsearch cluster

https://thoughts.t37.net/how-we-reindexed-36-billions-documents-in-5-days-within-the-same-elasticsearch-cluster-cd9c054d1db8

Advertisements

Elasticsearch – ABCD

Step 1: Install Elasticsearch
Step 2: Install Chrome Sense plugin
Step 3: Try following and feel Elasticsearch (Copy paste following to left navigation of Sense screen)

PUT /customer?pretty

GET /_cat/indices?v

PUT /customer/external/1
{
“first-name”: “John”,
“last-name”: “Brown”
}

PUT /customer/external/2
{
“first-name”: “John”,
“last-name”: “White”
}

PUT /customer/external/3
{
“first-name”: “John”,
“last-name”: “Johny”
}

PUT /customer/external/4
{
“first-name”: “Johnathan”,
“last-name”: “Smith”
}

PUT /customer/external/5
{
“first-name”: “JohnyJohny”,
“last-name”: “YesPapa”
}

PUT /customer/external/6
{
“first-name”: “John”,
“last-name”: “White Paper”
}

PUT /customer/external/7
{
“first-name”: “John”,
“last-name”: “”
}

GET /customer/external/2

DELETE /customer

GET /_cat/indices?v

POST /customer/external/1/_update
{
“doc”: { “name”: “Jane Doe” }
}

POST /customer/external/1/_update?pretty
{
“doc”: { “name”: “Jane Doe”, “age”: 20 }
}

POST /customer/external/1/_update?pretty
{
“script” : “ctx._source.age += 5”
}

GET /_nodes/process?pretty

———————————-

GET /customer/_search
{
“query” : {
“match” : {
“first-name” : “John”
}
}
}

GET /customer/_stats/

ES_ABCD

Lucene / Elasticsearch Analyzers

In Lucene, analyzer is a combination of tokenizer (splitter) + stemmer + stopword filter

In ElasticSearch, analyzer is a combination of

1. Character filter: “tidy up” a string before it is tokenize. Example: remove html tags
2. Tokenizer: MUST have a single tokenizer. It’s used to break up the string into individual terms or tokens
3. Token filter: change, add or remove tokens. Stemmer is a token filter, it is used to get base of word, for example: “happy”, “happiness” => “happi” (Snowball demo)

Reference:
https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-analyzers.html
http://stackoverflow.com/questions/12836642/analyzers-in-elasticsearch

Demo:
http://snowball.tartarus.org/demo.php

Reference:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-analyzer.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/analysis-intro.html

All About Analyzers:
https://www.elastic.co/blog/found-text-analysis-part-1
https://www.elastic.co/blog/found-text-analysis-part-2

Testing Lucene Analyzers with elasticsearch
http://jontai.me/blog/2012/10/testing-lucene-analyzers-with-elasticsearch/
“Here’s an awesome plugin on github repo. It’s somewhat extension of Analyze API. Found it on official elastic plugin list.

What’s great is that it shows tokens with all their attributes after every single step. With this it is easy to debug analyzer configuration and see why we got such tokens and where we lost the ones we wanted.”
https://github.com/johtani/elasticsearch-extended-analyze
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html

Elasticsearch GUI

1) Elasticsearch GUI: https://github.com/jettro/elasticsearch-gui – Apache 2.0 License

http://localhost:9200/_plugin/gui/index.html#/dashboard

2) Elastichead – https://mobz.github.io/elasticsearch-head/ – Apache 2.0 License

http://localhost:9200/_plugin/head/

Sense Chrome Plugin also very useful.

 

3) Elasticsearch HQ – Not good to use due to following clause.

We don’t store PII(Personally Identifiable Information). From time to time, the software will collect
anonymous usage data. None of it is being sold to anyone, so chill. The data collected gives us the
information we need to
customize the software for our users.

Reference: https://github.com/royrusso/elasticsearch-HQ/blob/master/index.html
They should have given option to disable this future.

4) Bigdesk – https://github.com/lukas-vlcek/bigdesk
No active development since 2014 and no release for latest ES.