Search Analyzers
- Elaine Foley
Simple Search Analyzer
access in search queries: simple
The simple search analyzer is a custom Picturepark implementation not using Elastic search defaults. The custom analyzer uses a regex:
Regex
*/"(\[^\\p\{L\}\\d\]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=\[\\p\{L\}&&\[^\\p\{Lu\}\]\])(?=\\p\{Lu\})|(?<=\\p\{Lu\})(?=\\p\{Lu\}\[\\p\{L\}&&\[^\\p\{Lu\}\]\])"/*
Outcome:
Lowercase / Uppercase
Digit / non-digit
Stemming
HTML Strip
Examples
Picturepark = Picturepark, picturepark
Case Study = Case, Study, case, study
If you want to test the simple search analyzer, you can check your terms in a regex tester to see the outcome.
Open a regex checker
Add your term as a test string
Check the outcome
Path Hierarchy Analyzer
access in search queries: pathHierarchy
The path hierarchy analyzer will:
Take a path found in a field (picturepark\platform\manual) and delimit the individual terms
Example
picturepark\platform\manual = picturepark\platform\manual, picturepark\platform, manual
Products/Family/Industry = Products/Family, Products, Products/Family/Industry
You should only configure this analyzer if being used via API. The simple search in Picturepark escapes Special Characters, and therefore you won't find assets when searching for some of the tokens generated by this analyzer.
An example can be found in Elastic Search Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pathhierarchy-tokenizer.html
Edge NGram Analyzer
access in search queries: edgeNGram
This tokenizer is very similar to nGram but only keeps n-grams that start at the beginning of a token. Settings allow to define min and max grams created on indexing and token_chars, which are characters classes to keep in the tokens, Elasticsearch splits on characters that don't belong to any of these classes.
Examples are in Elastic Search Documentation:
Edge n-gram tokenizer | Elasticsearch Guide [8.16] | Elastic
Ngram Analyzer
access in search queries: ngram
Starting point for exact substring matches was ngram tokenizing, which indexes all the substrings up to length n. The drawback of ngram tokenizing is a large amount of disk space used.
Best practice:
Use ngram only if required - use carefully and not for every string
Settings allow to define min and max grams created on indexing and token_chars, which are characters classes to keep in the tokens, Elasticsearch splits on characters that don't belong to any of these classes.
Example: Search "Raven"
NGrams (splits term into tokens with one character):
Rav
Rave
Raven
ave
aven
Ven
...
Example: Search "Pegasus"
NGrams (splits term into tokens with one character):
Pegasus
Degas
Examples are in Elastic Search Documentation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html
No Diacritics Analyzer
access in search queries: no-diacritics
The no diacritics analyzer:
only works for text fields
strip diacritic characters, so when the text value is: Kovačić Mateo you can search for “Kovačić Mateo” or “Kovacic Mateo”.
An example can be found in Elastic Search Documentation: ASCII folding token filter | Elasticsearch Guide [8.16] | Elastic
For advanced search queries on analyzed fields, the query can be adjusted to consider the analyzer.