Deep Learning and the Evolution of Useful Information
These days I spend most weekends in front of a computer, teaching remote learners how to extract information from data. That sounds vague and ambitious—and it is. It is actually more mundane. Most of the time, I teach statistics, the field I am trained in. But increasingly I am asked to teach machine learning and its cousin, artificial intelligence (AI). Initially, this baffled me. Why is a statistician (and one whose side qualifications are in mathematics and economics) considered suitable to teach what are essentially computational technologies? I submit that the answer lies in how these technologies use data in order to be successful. More pointedly, how they have redefined the notion of data and information in ways that are useful and crucial for their tasks. Let me explain.
—Deep learning has changed what we mean by data. Information that is analog in origin, like words, pictures, and molecules can now be directly interpreted by computers and learned from.—
Logistic Regression: keeping things simple
We begin with regression, specifically logistic regression. This has long been a staple algorithm for statisticians. A typical application begins with data on two kinds of entities, say, patients with benign tumors and malignant tumors. For both these kinds of patients we have data on the same set of variables, say, patient age, size of tumor, and presence of specific genes. The variables are expressed numerically (this will be important later), possibly with yes/no-type encoding. These variables are combined in a linear fashion—i.e., multiplying by coefficients and adding up. This linear combination is then converted to a score by passing it through a very special kind of function which (to be consistent with what’s to come) we will call an activation function. If the linear combination gets a low score when passed through the activation function, we predict the patient’s tumor is benign. If the score is high, our prediction is that the tumor is malignant.
Of course, the success of such an algorithm is determined by whether the prediction matches observed reality. The way to best ensure this is to keep tweaking (“estimating,” in the lingo) the coefficients that combine the variables until we get the best agreement between prediction and observation. We do this with historical data, where we know the true state of the tumor in our example. Clinicians will then want to validate this estimated logistic regression model with the prediction of future cases which the model has not “seen.”
Neural Networks: more complicated
In parallel to statisticians, computer scientists developed their version of logistic regression. Except that version was much more ambitious. They motivated activation functions by thinking of them as approximations of how a neuron in our brain takes inputs and fires an output yes/no. Once thought of this way, the logical extension was to think of neurons connecting with other neurons. In other words, one neuron’s output becomes another neuron’s input. We can arrange the activation functions in such a neural network. The neural network takes the same kind of input as the logistic regression and generates the same kind of binary prediction.
Neural networks are vastly superior to logistic regression in their ability to predict correctly. So, why were they not universally adopted and why did logistic regression (“a neural network with no hidden layers”) persist? The answer lies in the amount of data needed. The complexity of neural networks with their many layers of activation functions comes with a price. There are many more coefficients to estimate (here the lingo would be “weights to learn”) and so a lot more data is needed. Moreover, substantially more computational time and power are needed. Neural networks have overcome the computation limitation as technology has advanced, but the data limitation remained. Simply put, neural networks did not have enough information.
Such was the state of affairs in the “AI winter” of the 1990s. The historic AI Lab at MIT was on the way to being eventually combined with the Laboratory for Computer Science, and the Biostatistics department at neighboring Harvard was several times the size of the university’s Statistics department. Logistic regression was flourishing; neural networks were not. That was about to change.
Deep Neural Networks: getting serious with images and text
Let’s go back to our tumor-prediction example. Suppose we also have CT scans (combination of X-ray images) available for each patient. These scans are in standard diagnostic use and they can aid in making better predictions. Can this new information be incorporated into logistic regressions and neural networks? Somewhat trivially, yes. All we need to do is extract pixel values from the images and treat the pixel values as numerical inputs. But there is a catch. The dimensions of the input data now become very large. Conservatively, we will have hundreds of thousands of pixel values. A standard clinical trial may have a few hundred patients. Conventional statistical estimation of logistic regression coefficients breaks down (there are clever recent fixes to this). Neural networks also have this problem. The number of weights to learn is large and the extra dimensions from the image data actually help in learning, but still the problem of the small number of examples to learn from persists.
Welcome to the world of Big Data. Here, images come in large volumes, many varieties, and at high velocity. As I write this, https://image-net.org/ has over 14 million images. This is a gold mine for neural networks to learn from and many neural networks have indeed learned from. Finally, neural networks had information—and the resulting data—that leveraged their adaptive power to predict.
But an image is more than just a collection of pixels. The pixels describe objects and scenes. They have non-numeric context. The conversion of pixels to raw numbers loses this context. Images are informative in ways that are more than the sum of their pixels. The next innovation in neural networks got at the heart of this.
The 2018 Turing Award (the “Nobel Prize of Computing”) was given to Bengio, Hinton, and LeCun for their work on deep neural networks. Over the last two decades or so, they and other computer scientists developed ways to provide inputs to neural networks that capture objects in images, phrases in text, and other such informative features.
To better understand how deep neural networks extract this kind of information, consider a specific influential variant: convolutional neural networks (CNNs). A convolution is an operation done on a small set of neighboring pixels. What they do is identify whether the set of pixels together represent something informative. For example, an edge or border where shade or color changes sharply. This operation (a convolution) is, loosely speaking, a form of differencing. If there is an edge present in a 3 x 3 set of neighboring pixels, then there will be a difference in pixel value between top and bottom or between left and right. A suitable convolution picks that up and passes the value to the next layer of the neural network. The next later may be another set of convolutions, or it can be a so-called subsampling or pooling layer, where the results of the convolutions are sampled or averaged. This keeps the number of values being passed from layer to layer in check.
There is a crucial idea to note here. Which particular convolution to use is not pre-decided by the programmer. This is learned by the neural net itself using variants of the weight adjustment or “back propagation” process. So, deep convolutional neural networks learn not the best way to combine inputs, but rather the best transformation of the inputs that associates with the outcome that needs to be predicted. In a sense, a CNN is a feature selector as well. We may not be able to understand the selected features as humans do, but many examples show that the prediction or classification problem is well-solved for images.
Convolutional Neural Network
A whimsical example may illustrate. One of the standard test cases used for image classification concerns the ability to correctly classify an image as that of a dog or of a cat. A CNN is trained on a large collection of labelled images where the algorithm knows what a cat image looks like and what a dog image looks like. Suppose that cats often have sleek coats and that many dogs are, well, shaggy. If this is indeed the case, then the edge convolutions we described above can identify the nature of the boundary between animal and background by tracing the outlines of the shape. In turn, this feature can be selected or retained by the various layers of the CNN as it is correctly (one presumes) associated with sleek cats or shaggy dogs.
Other types of deep neural networks, such as recurrent neural networks (RNNs) can similarly learn features from other kinds of data. A RNN processes words in a passage of text, one at a time, so the sequence of words in a phrase or a sentence can be associated with an outcome, such as positive or negative sentiment.
The reasons for the great and growing success of deep neural networks are many, including innovative computational steps in back propagation and computer memory usage. But a primary reason is the one we have emphasized here—the extension of learning from numbers to learning from other kinds of information, such as images and text. The human brain and senses do this kind of information extraction routinely as we see, speak, and hear. Now computers are catching up. AI is no longer in winter.
We note in passing that deep learning now encompasses more than deep neural networks. This includes generative models (as opposed to the discriminative models of the type above). For example, these models create new images (distinct from predicting the class of a new image). Given the ability of deep learning’s feature extraction ability, these generated “fake” images can be scarily close to the real thing.
Geometric Deep Learning: moving on to networks and molecules
Progress continues. One particularly promising direction is the notion of geometric deep learning. A recent description of this is machine learning from “grids, groups, graphs, geodesics, and gauges.” Very roughly speaking, these refer to various notions of distance. In an image, we have a natural notion of how far apart objects are. We can calculate things using Euclidean distance—the usual length of a straight line between objects. Now consider an image representation of a 3-dimensional object. When seeing a movie, we have no difficulty in recreating a depth perception in our brain despite seeing a 2-dimensional screen. The brain does this by processing a notion of depth which essentially distorts the notion of distance from what we “see” in 2-d to a 3-d reality implied by the scene. This distorted distance can be referred to as a geodesic, the name deriving from what we might calculate to be the distance between two points on a globe based on what we see on a map. Geometric deep learning extends deep learning by using such alternate non-Euclidean notions of distance. Applications to 3-dimensional graphics follow this route.
Distances can be more imaginatively defined. A graph is a set of points connected by edges. (Not all points need be connected to each other.) Distance here refers to the number of edges needed to be traversed to get from one point to another. A social network can be modeled as such a graph with edges being connections between people. Another different kind of example is a molecule, where the points can be atoms or functional groups of atoms. The edges are bonds or other chemical affinities. By modeling molecules as graphs, pioneers in geometric deep learning are making contributions to drug design and using machine learning to find effective new medicines for cancer. The cycle of diagnosis to therapy by AI is nearing completion.
Deep learning has changed what we mean by data. Information that is analog in origin, like words, pictures, and molecules can now be directly interpreted by computers and learned from. Claude Shannon taught us that information is digital, and it is true that our current algorithms convert information to digital data. But we seem to be moving towards a more direct use of information that bypasses obvious digitization. One of my literary heroes growing up, Isaac Asimov, imagined a future managed by a computer named Multivac, where the last “ac” stood for analog computer. Perhaps Asimov’s prediction of the future of computing and its nature of information processing may have some merit after all. We shall know, soon.
If you want to dig deeper:
Logistic regression, neural networks, and deep neural networks are part of a set of methods laid out in https://www.statlearning.com/—the second edition of a book that has had great influence on students and adult learners.
The founders of deep learning wrote about their understanding of the emerging field and their hopes for it in https://www.nature.com/articles/nature14539. This is a highly cited review paper written for a broader scientific audience.
The mathematics needed to understand the unifications proposed in geometric deep learning can be daunting. But seeing things helps. The video by Bronstein at www.youtube.com/watch?v=w6Pw4MOzMuo does this very well.
Cite this article in APA as: Sarkar, A. (2021, December 20). Deep learning and the evolution of useful information. Information Matters, Vol. 1, Issue 12. https://r7q.22f.myftpupload.com/2021/12/deep-learning-and-the-evolution-of-useful-information/