ngram Full-Text Parser


Ideographic languages such as Chinese and Japanese do not have word delimiters. Therefore, [MySQL's] FULLTEXT parser cannot determine where words begin and end in these and other such languages.

MariaDB does not support the ngram and MeCab full-text parser plugins.

1. InnoDB full-text index (for Chinese search)


Before MySQL 5.7.6, full-text indexing only supported full-text indexing in English, but not full-text indexing in Chinese. You need to use a tokenizer to split the Chinese paragraphs into words and then store them in the database.


Starting with MySQL 5.7.6, MySQL has built-in ngram Full-text parser, used to support Chinese, Japanese, Korean word segmentation, and supports Myisam and InnoDB.

2. ngram full text parser

ngram is a sequence of n consecutive words in a piece of text. The ngram full-text parser can segment the text, and each word is a continuous sequence of n words. or example, you can tokenize “abcd” for different values of n using the ngram full-text parser.


n=1: 'a', 'b', 'c', 'd'
n=2: 'ab', 'bc', 'cd'
n=3: 'abc', 'bcd'
n=4: 'abcd'	

The global variable ngram_token_size is used in MySQL to configure the size of n in ngram. Its value range is 1 to 10, and the default value is 2. Usually ngram_token_size is set to the minimum number of words to be queried.


If you need to search for a word, set ngram_token_size to 1. In the case of the default value of 2, the search word can not get any results. Because the Chinese words are at least two Chinese characters, the default value of 2 is recommended. But the example here sets 1.


Two ways to set the global variable ngram_token_size:


1. Modify the MySQL configuration file (after setting, you need to restart the Mysql service and rebuild the index)


[mysqld] 
ngram_token_size=1
ft_min_word_len=1	

2. Set when starting Mysql

mysqld --ngram_token_size=2	

After setting, restart the mysql service,

DROP INDEX index_name ON table_name;
ALTER TABLE table_name ADD FULLTEXT INDEX ft_index(title) WITH PARSER ngram;

3. Create a full-text index

1. Create a full-text index while creating the table (note to add WITH PARSER ngram)

CREATE TABLE articles (
    id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
    title VARCHAR (200),
    body TEXT,
    FULLTEXT (title, body) WITH PARSER ngram
) ENGINE = INNODB;

2. Add by alter table (note to add WITH PARSER ngram)

ALTER TABLE articles ADD FULLTEXT INDEX ft_index (title, body) WITH PARSER ngram;

3. Directly through the create index method (note to add WITH PARSER ngram)

CREATE FULLTEXT INDEX ft_index ON articles (title, body) WITH PARSER ngram;

4. Full-text search mode


There are two commonly used full-text search modes:


  1. Natural language mode (NATURAL LANGUAGE MODE)
    Natural language mode is MySQL's default full-text search mode. Natural language mode cannot use operators, and cannot specify complex queries such as keywords must appear or must not appear.
  2. BOOLEAN MODE
    The BOOLEAN mode can use operators, which can support complex queries that specify whether the keyword must appear or must not appear, or whether the weight of the keyword is high or low.

4.1. Examples

// Natural language mode
SELECT * FROM articles
WHERE MATCH (title, body)
 AGAINST ('Open source Wiki engine' IN NATURAL LANGUAGE MODE);
 
 // Do not specify the mode, use the natural language mode by default
SELECT * FROM articles
WHERE MATCH (title, body)
 AGAINST ('Open source Wiki engine');
 
 // Specify Boolean mode, display all the way, not display a zone
SELECT * FROM articles
WHERE MATCH (title,body)
 AGAINST ('+ powerful and extensible -WYTIWYG editor' IN BOOLEAN MODE);

Note

  • You can only create full-text indexes on fields of type CHAR, VARCHAR, or TEXT.
  • Full-text indexing only supports InnoDB and MyISAM engines.
  • MATCH (columnName) AGAINST ('keywords')
  • The field name used by the MATCH () function must be the same as the field name specified when creating the full-text index.
  • As in the example above, MATCH (title,body) Field names used and full-text index ft_articles(title,body). The defined field names are consistent. If you want to query the title or body fields separately, you need to create a new full-text index on the title and body fields.
  • The field names used by the MATCH () function can only be the fields of the same table, because full-text indexes cannot be retrieved across multiple tables.
  • If you want to import a large data set, using the method of importing data and then creating a full-text index on the table is much faster than creating a full-text index on the table and then importing the data, so the full-text index affects TPS.
  • When adding an index, remember to add WITH PARSER ngram, otherwise it will not take effect.

5. Resources