In late 2016, I was assigned the tech lead of PaddlePaddle, the open-source deep learning system of Baidu. The team updated the technology from the generation of Caffe1 to something towards a deep learning language and named it PaddlePaddle Fluid. By the year of 2019, the Fluid version is overwhelmingly used in Baidu products. There were a lot of good things that happened during the journey, but this article is all about regrets.

At a time when TenosrFlow had built a large community and before the release of PyTorch, the work to upgrade PaddlePaddle from using the graph-based autodiff approach to a recent generation of technology basically implies the two choices, (1) imperative programming, also known as the dynamic network, and (2) autodiff-by-the-compiler, I aggressively chose the later one and named it PaddlePaddle Fluid, which, however, took the team two years to deliver a stable and usable system. …

It is a common bias that SQL cannot handle unstructured data like text. However, this is not true. In this article, we explain how to tokenize text, build the vocabulary, normalize the word distribution, and compute pair-wise similarities between the documents, all in SQL.

In the next article, we will explain how to extend SQL syntax with SQLFlow to support latent topic modeling, a machine learning technique to learn semantics.

Import Text Data

Suppose that we have a text file that contains three sentences, each on a line.

fresh carnation flower
mother day
mother teresa

Let us create a table and import the text file into it. As we want to automatically assign each document a unique ID, we create the table with ID as an auto-incremental integer. …

Yi Wang

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store