Computational Finance Journal: Dirichlet priors used by Borodin et al.

On the Dirichlet Prior and Bayesian Regularization:

The Problem: To understand how Bayesian regularization using a Dirichlet prior over the model parameters affects the learned model structure.

Motivation & PreviousWork: A common objective in learning a model from data is to recover its network structure, while the model parameters are of minor interest. For example, we may wish to recover regulatory networks from high-throughput data sources. Regularization is essential when learning from finite data sets. It provides not only smoother estimates of the model parameters compared to maximum likelihood but also guides the selection of model structures. In the Bayesian approach, regularization is achieved by specifying a prior distribution over the parameters and subsequently averaging over the posterior distribution. In domains comprising discrete variables with a multinomial distribution, the Dirichlet distribution is the most commonly used prior over the parameters. This is because of the following two reasons: first, the Dirichlet distribution is the conjugate prior to the multinomial distribution and hence permits analytical calculations, and second, the Dirichlet prior is intimately tied to the desirable likelihood-equivalence property of network structures [1, 3]. The so-called equivalent sample size measures the strength of the prior belief. In [3], it was pointed out that a very strong prior belief can degrade predictive accuracy of the learned model due to severe regularization of the parameter estimates. In contrast to that, the dependence of the learned network structure on the prior strength has not received much attention in the literature, despite its relevance for the recovery of the true network structure underlying the data.

Approach: Our work focuses on the effects of prior strength on the regularization of the learned network structure; in particular, we consider the class of Bayesian network (Belief Networks) models. Surprisingly, it turns out that a weak prior in the sense of a small equivalent sample size leads to a strong regularization of the model structure (sparse graph) given a sufficiently large data set. In particular, the empty graph is obtained in the limit of a vanishing strength of prior belief, independent of any dependencies implied by the (sufficiently large) data set. This is diametrically opposite to what one may expect in this limit, namely the complete graph from an (unregularized) maximum likelihood estimate.

This surprising effect is a consequence of the Dirichlet prior distribution. In the limit of a vanishing prior strength, the Dirichlet prior converges to a discrete distribution over the parameter simplex in the sense that the probability mass concentrates on the corners of the simplex. This is due to the vanishing hyper-parameters of the Dirichlet prior.

In the other extreme case, where the prior strength is very large, a very dense graph structure is typically obtained. Between these two extreme cases, there is a gradual transition from sparser to denser graph structures as the prior strength increases. This implies that regularization of network structure diminishes with a growing prior strength. Surprisingly, this is in the opposite direction to the regularization of parameters, as the latter behaves as expected, i.e., parameter regularization increases with a growing prior strength. Hence, the strength of the prior belief balances the trade-off between regularizing the parameters and the structure of the Bayesian network model.

When learning Bayesian networks from data, a careful choice of prior strength is hence necessary in order to achieve a (close to) optimal trade-off. The extreme cases do not provide useful insight in the statistical dependencies among the variables in a domain: the limit of a vanishing prior strength entails that, given a sufficiently large data set, the parameters pertaining to each individual variable are estimated in a maximum-likelihood manner, independently of all the other variables (empty graph); in the other extreme case, while a very strong prior belief can entail the complete graph, the estimated parameters are so severely smoothed that the resulting model predicts a uniform distribution over all the variables in the domain (an uninformative prior over the parameters is assumed).

Impact: Our work shows that the prior strength does not determine the degree of regularization of the model as an entity; instead, the prior strength determines the trade-off between regularizing the parameters vs. the structure of the model. Not only does this surprising finding enhance the theoretical understanding of Bayesian regularization using a Dirichlet prior, but also has it a major impact on practical applications of learning Bayesian network models in domains with discrete variables: the prior strength has to be chosen with great care in order to achieve an optimal trade-off, enabling one to recover the true network structure underlying the data.

this was derived from http://www.csail.mit.edu/research/abstracts/abstracts03/
machine-learning/17steck.pdf

Computational Finance Journal

Wednesday, September 22, 2004

Dirichlet priors used by Borodin et al.

0 Comments:

About

Previous