NLP in SQL: Word Vectors

It is a common bias that SQL cannot handle unstructured data like text. However, this is not true. In this article, we explain how to tokenize text, build the vocabulary, normalize the word distribution, and compute pair-wise similarities between the documents, all in SQL.

In the next article, we will explain how to extend SQL syntax with SQLFlow to support latent topic modeling, a machine learning technique to learn semantics.

Import Text Data

Suppose that we have a text file that contains three sentences, each on a line.

Let us create a table and import the text file into it. As we want to automatically assign each document a unique ID, we create the table with ID as an auto-incremental integer.

Before we import the file, we need to move it to a special location where MySQL could access it. To find this location, please run the following command.

After copying the text file to /var/lib/mysql-files, we can run the following command to import it.

Now, we can check the result.

Tokenization

Tokenization is the most tricky part of this article. Some RDBMS user-defined functions (UDFs) that implement some tokenization algorithms. For generality, we explain a simple solution that separates string by whitespaces.

MySQL provides a function SUBSTRING_INDEX(string, delim, index), which, from the signature, looks like it separates string by delim into an array and returns the index-th element of this array. However, it is out of our imagination and returns the first index elements. To get the i-th substring separated by whitespaces, we need to call SUBSTRING_INDEX twice — the first call returns the first i substrings, and the second one returns the last substring of the result.

The above code defines and calls a procedure generate_sequence(), which inserts 5 values to the table incr. We can check incr.n contains successive integers.

Given the sequence, we can do tokenization by converting each document string into at most 5 words.

The join operation in the above FROM clause duplicates each document by n+1 times, where n is the number of whitespaces. The SELECT part transforms each duplicate into a word. The CREATE TABLE part writes the result into a table doc_words.

Stopwords

Suppose that we have a table of stopwords and we want to remove stopwords from the above result using the NOT IN predicate. The following example assumes that the stopword table contains only one row — fresh.

We can check the vocabulary after stopword removal using the DISTINCT keyword.

Word Vectors

To represent each document by a vector of word counts, or, a word histogram, the following statement counts the occurrences of each word in a document and saves the result into table doc_word_count.

We can check the result.

Word Distributions

To normalize a histogram into distribution, we need the length of each document.

The following command divides each word count by the corresponding document length and writes the result into table doc_word_list.

Let’s check the result.

Pairwise Comparison

The following command computes the dot-product similarity of every pair of documents.

In our case, the result would contain only one line, because only doc 2 and 3 share the common word “mother”; all other pairs have 0 similarities.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store