It is a common bias that SQL cannot handle unstructured data like text. However, this is not true. In this article, we explain how to tokenize text, build the vocabulary, normalize the word distribution, and compute pair-wise similarities between the documents, all in SQL.
In the next article, we will explain how to extend SQL syntax with SQLFlow to support latent topic modeling, a machine learning technique to learn semantics.
Import Text Data
Suppose that we have a text file that contains three sentences, each on a line.
fresh carnation flower
mother day
mother teresa
Let us create a table and import the text file into it. As we want to automatically assign each document a unique ID, we create the table with ID as an auto-incremental integer.
CREATE DATABASE IF NOT EXISTS play;USE play;DROP TABLE IF EXISTS docs;CREATE TABLE IF NOT EXISTS docs (
id INT NOT NULL AUTO_INCREMENT,
doc TEXT,
PRIMARY KEY (id));
Before we import the file, we need to move it to a special location where MySQL could access it. To find this location, please run the following command.
mysql> SHOW VARIABLES LIKE "secure_file_priv";
+------------------+-----------------------+
| Variable_name | Value |
+------------------+-----------------------+
| secure_file_priv | /var/lib/mysql-files/ |
+------------------+-----------------------+
After copying the text file to /var/lib/mysql-files, we can run the following command to import it.
LOAD DATA INFILE “/var/lib/mysql-files/a.txt”
INTO TABLE docs (doc);
Now, we can check the result.
mysql> SELECT * FROM docs;
+----+------------------------+
| id | doc |
+----+------------------------+
| 1 | fresh carnation flower |
| 2 | mother day |
| 3 | mother teresa |
+----+------------------------+
Tokenization
Tokenization is the most tricky part of this article. Some RDBMS user-defined functions (UDFs) that implement some tokenization algorithms. For generality, we explain a simple solution that separates string by whitespaces.
MySQL provides a function SUBSTRING_INDEX(string, delim, index), which, from the signature, looks like it separates string by delim into an array and returns the index-th element of this array. However, it is out of our imagination and returns the first index elements. To get the i-th substring separated by whitespaces, we need to call SUBSTRING_INDEX twice — the first call returns the first i substrings, and the second one returns the last substring of the result.
DROP TABLE IF EXISTS incr;
DROP PROCEDURE IF EXISTS generate_sequence;CREATE TABLE IF NOT EXISTS incr (
n INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY (n));DELIMITER //
CREATE PROCEDURE generate_sequence()
BEGIN
DECLARE i int DEFAULT 0;
WHILE i < 5 DO
INSERT INTO incr () VALUES ();
SET i = i + 1;
END WHILE;
END
//
DELIMITER ;CALL generate_sequence;
The above code defines and calls a procedure generate_sequence(), which inserts 5 values to the table incr. We can check incr.n contains successive integers.
mysql> select * from incr;
+----+
| n |
+----+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
+----+
Given the sequence, we can do tokenization by converting each document string into at most 5 words.
CREATE TABLE doc_words
SELECT
docs.id,
SUBSTRING_INDEX(SUBSTRING_INDEX(docs.doc, ' ', incr.n), ' ', -1) word
FROM
incr INNER JOIN docs
ON CHAR_LENGTH(docs.doc)
-CHAR_LENGTH(REPLACE(docs.doc, ' ', ''))>=incr.n-1
ORDER BY
id, n;
The join operation in the above FROM clause duplicates each document by n+1 times, where n is the number of whitespaces. The SELECT part transforms each duplicate into a word. The CREATE TABLE part writes the result into a table doc_words.
mysql> select * from doc_words;
+----+-----------+
| id | word |
+----+-----------+
| 1 | fresh |
| 1 | carnation |
| 1 | flower |
| 2 | mother |
| 2 | day |
| 3 | mother |
| 3 | teresa |
+----+-----------+
Stopwords
Suppose that we have a table of stopwords and we want to remove stopwords from the above result using the NOT IN predicate. The following example assumes that the stopword table contains only one row — fresh.
mysql> SELECT * FROM doc_words WHERE word NOT IN (SELECT 'fresh');
+----+-----------+
| id | word |
+----+-----------+
| 1 | carnation |
| 1 | flower |
| 2 | mother |
| 2 | day |
| 3 | mother |
| 3 | teresa |
+----+-----------+
We can check the vocabulary after stopword removal using the DISTINCT keyword.
mysql> SELECT DISTINCT word FROM doc_words WHERE word NOT IN (SELECT 'fresh');
+-----------+
| word |
+-----------+
| carnation |
| flower |
| mother |
| day |
| teresa |
+-----------+
Word Vectors
To represent each document by a vector of word counts, or, a word histogram, the following statement counts the occurrences of each word in a document and saves the result into table doc_word_count.
CREATE TABLE doc_word_count
SELECT id, word, count(word) as count
FROM doc_words GROUP BY id, word;
We can check the result.
mysql> SELECT * FROM doc_word_count;
+----+-----------+-------+
| id | word | count |
+----+-----------+-------+
| 1 | carnation | 1 |
| 1 | flower | 1 |
| 1 | fresh | 1 |
| 2 | day | 1 |
| 2 | mother | 1 |
| 3 | mother | 1 |
| 3 | teresa | 1 |
+----+-----------+-------+
Word Distributions
To normalize a histogram into distribution, we need the length of each document.
mysql> SELECT id, sum(count) as len FROM doc_word_count GROUP BY id;
+----+------+
| id | len |
+----+------+
| 1 | 3 |
| 2 | 2 |
| 3 | 2 |
+----+------+
The following command divides each word count by the corresponding document length and writes the result into table doc_word_list.
CREATE TABLE doc_word_dist
SELECT doc_word_count.id, word, count/len AS prob
FROM doc_word_count,
(SELECT id, sum(count) as len FROM doc_word_count GROUP BY id) s
WHERE doc_word_count.id = s.id;
Let’s check the result.
mysql> SELECT * FROM doc_word_dist;
+----+-----------+--------+
| id | word | prob |
+----+-----------+--------+
| 1 | carnation | 0.3333 |
| 1 | flower | 0.3333 |
| 1 | fresh | 0.3333 |
| 2 | day | 0.5000 |
| 2 | mother | 0.5000 |
| 3 | mother | 0.5000 |
| 3 | teresa | 0.5000 |
+----+-----------+--------+
Pairwise Comparison
The following command computes the dot-product similarity of every pair of documents.
SELECT x.id, y.id, sum(x.prob*y.prob)
FROM doc_word_dist x, doc_word_dist y
WHERE x.id > y.id AND x.word = y.word
GROUP BY x.id, y.id;
In our case, the result would contain only one line, because only doc 2 and 3 share the common word “mother”; all other pairs have 0 similarities.
+----+----+--------------------+
| id | id | sum(x.prob*y.prob) |
+----+----+--------------------+
| 3 | 2 | 0.25000000 |
+----+----+--------------------+