Rozmiar: 8938 bajtów


Principle of Maximum Entropy



#REDIRECT Principle_of_maximum_entropy

Principle of maximum entropy



The principle of maximum entropy is a method for analyzing the available information in order to determine a unique Bayesian probability distribution. Claude E. Shannon, the originator of information theory, defined a measure of uncertainty for a probability distribution (''H''(p) = - Σ ''pi'' log ''pi'') which he called information entropy. In his work, information entropy was determined from (i.e. was a function of) a given probability distribution. The principle of maximum entropy tells us that the converse is also possible: a probability distribution can be determined using the information entropy concept. It states the probability distribution that uniquely represents or encodes our state of information is the one that maximizes the uncertainty measure ''H''(p) while remaining consistent with our information. Naturally, this rule is meaningless to those who espouse the relative frequency, for whom probabilities are relative frequencies rather than degrees of belief in uncertain propositions, conditional upon a state of information. The principle is controversial even among those who endorse the epistemic approach to probability. == Testable information == The principle of maximum entropy is only useful when all of our information is of a class called testable information. A piece of information is testable if we can determine whether or not a given distribution is consistent with it. For example, the statements :"The expectation of the variable ''x'' is 2.87" and :"''p2''+''p3'' > 0.9" are statements of testable information. Given testable information, the maximum entropy procedure consists of seeking the probability distribution which maximizes information entropy, subject to the constraints of the information. This constrained optimization problem is typically solved using the method of Lagrange multipliers. Entropy maximization with no testable information takes place under a single constraint: the sum of the probabilities must be one. Under this constraint, the maximum entropy probability distribution is the uniform distribution, :p_i=\frac{1}{n}\ {\rm for\ all}\ i\in\{\,1,\dots,n\,\}. The principle of maximum entropy can thus be seen as a generalization of the classical principle of indifference, also known as the principle of insufficient reason. ==General solution for the maximum entropy distribution with linear constraints == === Discrete case === We have some testable information ''I'' about a quantity ''x'' ∈ {''x1'', ''x2'',..., ''xn''}. We express this information as ''m'' constraints on the expectations of the functions ''fk'', i.e. we require our epistemic probability distribution to satisfy :\sum_{i=1}^n \Pr(x_i|I)f_k(x_i) = F_k \qquad k = 1, \cdots,m Furthermore, the probabilities must sum to one, giving the constraint :\sum_{i=1}^n \Pr(x_i|I) = 1 The probability distribution with maximum information entropy subject to these constraints is :\Pr(x_i|I) = \frac{1}{Z(\lambda_1,\cdots, \lambda_m)} \exp\left[\lambda_1 f_1(x_i) + \cdots + \lambda_m f_m(x_i)\right] with normalization constant determined by : Z(\lambda_1,\cdots, \lambda_m) = \sum_{i=1}^n \exp\left[\lambda_1 f_1(x_i) + \cdots + \lambda_m f_m(x_i)\right] (Interestingly, the Pitman-Koopman theorem states that the necessary and sufficient condition for a sampling distribution to admit sufficiency (statistics) is that it have the general form of a maximum entropy distribution.) The λk parameters are Lagrange multipliers whose particular values are determined by the constraints according to :F_k = -\frac{\partial}{\partial \lambda_k} \log Z(\lambda_1,\cdots, \lambda_m) These ''m'' simultaneous equations do not generally possess a closed form solution, and are usually solved by numerical methods. === Continuous case === For continuous distributions, (E.T. Jaynes, 1963, 1968, 2003) finds that the limiting form of the entropy expression as the distribution approaches a continuous distribution is :H_c=-\int p(x)\log\frac{p(x)}{m(x)}\,dx where ''m''(''x''), which Jaynes called the "invariant measure", is proportional to the limiting density of discrete points. For now, we shall assume that it is known; we will discuss it further after the solution equations are given. We have some testable information ''I'' about a quantity ''x'' which takes values in some interval (mathematics) of the real numbers (all integrals below are over this interval). We express this information as ''m'' constraints on the expectations of the functions ''fk'', i.e. we require our epistemic probability density function to satisfy :\int p(x|I)f_k(x)dx = F_k \qquad k = 1, \cdots,m And of course, the probability density must integrate to one, giving the constraint :\int p(x|I)dx = 1 The probability density function with maximum ''H_c'' subject to these constraints is :p(x|I) = \frac{1}{Z(\lambda_1,\cdots, \lambda_m)} m(x)\exp\left[\lambda_1 f_1(x) + \cdots + \lambda_m f_m(x)\right] with normalization constant determined by : Z(\lambda_1,\cdots, \lambda_m) = \int m(x)\exp\left[\lambda_1 f_1(x) + \cdots + \lambda_m f_m(x)\right]dx As in the discrete case, the values of the λk parameters are determined by the constraints according to :F_k = -\frac{\partial}{\partial \lambda_k} \log Z(\lambda_1,\cdots, \lambda_m) The invariant measure function ''m''(''x'') can be best understood by supposing that ''x'' is known to take values only in the bounded interval (''a'', ''b''), and that no other information is given. Then the maximum entropy probability density function is : p(x|I) = A \cdot m(x), \qquad a < x < b where ''A'' is a normalization constant. The invariant measure function is actually the prior density function encoding 'lack of relevant information'. It cannot be determined by the principle of maximum entropy, and must be determined by some other logical method, such as the principle of transformation groups or marginalization theory. === Examples === For several examples of maximum entropy distributions, see the article on maximum entropy probability distributions. == Justifications for the principle of maximum entropy == Proponents of the principle of maximum entropy justify its use in assigning epistemic probabilities in several ways, including the following two arguments. These arguments take the use of epistemic probability as given, and thus have no force if the concept of epistemic probability is itself under question. === Information entropy as a measure of 'uninformativeness' === Consider a discrete epistemic ''probability distribution'' among ''m'' mutually exclusive propositions. The most informative distribution would occur when one of the propositions was known to be true. In that case, the information entropy would be equal to zero. The least informative distribution would occur when there is no reason to favor any one of the propositions over the others. In that case, the only reasonable probability distribution would be uniform, and then the information entropy would be equal to its maximum possible value, log ''m''. The information entropy can therefore be seen as a numerical measure which describes how uninformative a particular probability distribution is from zero (completely informative) to log ''m'' (completely uninformative). By choosing to use the distribution with the maximum entropy allowed by our information, the argument goes, we are choosing the most uninformative distribution possible. To choose a distribution with lower entropy would be to assume information we do not possess; to choose one with a higher entropy would violate the constraints of the information we ''do'' possess. Thus the maximum entropy distribution is the only reasonable distribution. === The Wallis derivation === The following argument is the result of a suggestion made by Graham Wallis to E. T. Jaynes in 1962 (Jaynes, 2003). It is essentially the same mathematical argument used for the derivation of the partition function in statistical mechanics, although the conceptual emphasis is quite different. It has the advantage of being strictly combinatorial in nature, making no reference to information entropy as a measure of 'uncertainty', 'uninformativeness', or any other imprecisely defined concept. The information entropy function is not assumed ''a priori'', but rather is found in the course of the argument; and the argument leads naturally to the procedure of maximizing the information entropy, rather than treating it in some other way. Suppose an individual wishes to make an epistemic probability assignment among ''m'' mutually exclusive propositions. She has some testable information, but is not sure how to go about including this information in her probability assessment. She therefore conceives of the following random experiment. She will distribute ''N'' quanta of epistemic probability (each worth 1/''N'') at random among the ''m'' possibilities. (One might imagine that she will throw ''N'' balls into ''m'' buckets while blindfolded. In order to be as fair as possible, each throw is to be independent of any other, and every bucket is to be the same size.) Once the experiment is done, she will check if the probability assignment thus obtained is consistent with her information. If not, she will reject it and try again. Otherwise, her assessment will be :p_i = \frac{n_i}{N} where ''ni'' is the number of quanta that were assigned to the ''i''th proposition. Now, in order to reduce the 'graininess' of the epistemic probability assignment, it will be necessary to use quite a large number of quanta of epistemic probability. Rather than actually carry out, and possibly have to repeat, the rather long random experiment, our protagonist decides to simply calculate and use the most probable result. The probability of any particular result is the multinomial distribution, :Pr(\mathbf{p}) = W \cdot m^{-N} where :W = \frac{N!}{n_1 !n_2 !...n_m!} is sometimes known as the multiplicity of the outcome. The most probable result is the one which maximizes the multiplicity ''W''. Rather than maximizing ''W'' directly, our protagonist could equivalently maximize any monotonic increasing function of ''W''. She decides to maximize :\begin{matrix}\frac{1}{N}\log W &=& \frac{1}{N}\log \frac{N!}{n_1 !n_2 !...n_m!}\qquad\qquad\qquad\qquad\qquad \\ \\ \ &=& \frac{1}{N}\log \frac{N!}{Np_1 !Np_2 !...Np_m!} \qquad\qquad\qquad\qquad\\ \\ \ &=& \frac{1}{N}\left( \log N! - \sum_{i=1}^m \log Np_i! \right) \qquad\qquad\end{matrix} At this point, in order simplify the expression, our protagonist takes the limit as ''N'' → ∞, i.e. as the epistemic probability levels go from grainy discrete values to smooth continuous values. Using Stirling's approximation, she finds :\begin{matrix}\lim_{N \to \infty}\left(\frac{1}{N}\log W\right) &=& \frac{1}{N}\left( N\log N - \sum_{i=1}^m Np_i\log Np_i \right)\qquad\qquad\qquad\qquad \\ \\ \ &=& \log N - \sum_{i=1}^m p_i\log Np_i \qquad\qquad\qquad\qquad\qquad\qquad \\ \\ \ &=& \log N - \log N \sum_{i=1}^m p_i - \sum_{i=1}^m p_i\log p_i \qquad\qquad\qquad \\ \\ \ &=& \left(1 - \sum_{i=1}^m p_i \right)\log N - \sum_{i=1}^m p_i\log p_i \qquad\qquad\qquad \\ \\ \ &=& - \sum_{i=1}^m p_i\log p_i \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad \\ \\ \ &=& H(\mathbf{p}) \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad \end{matrix} All that remains for our protagonist to do is to maximize entropy under the constraints of her testable information. She has found that the maximum entropy distribution is the most probable of all "fair" random epistemic distributions, in the limit as the probability levels go from discrete to continuous. == References == *Jaynes, E. T., 1963, 'Information Theory and Statistical Mechanics', in Statistical Physics, K. Ford (ed.), Benjamin, New York, p. 181. Available [http://bayes.wustl.edu/etj/node1.html here]. *Jaynes, E. T., 1968, 'Prior Probabilities', IEEE Trans. on Systems Science and Cybernetics, SSC-4, 227. Available [http://bayes.wustl.edu/etj/node1.html here]. *Jaynes, E. T., 2003, 'Probability Theory: The Logic of Science', Cambridge University Press. ==External links == * Another very useful, easy-to-read tutorial which complements this article is :Ratnaparkhi A. "A simple introduction to maximum entropy models for natural language processing" Technical Report 97-08, Institute for Research in Cognitive Science, University of Pennsylvania, 1997. Available [http://citeseer.ist.psu.edu/128751.html here]. * [http://homepages.inf.ed.ac.uk/s0450736/maxent.html Maximum Entropy Modeling] :This page contains pointers to various papers and software implementations of Maximum Entropy Model on the net. information theory

Principle of maximum entropy



I find this discussion very doctrinaire and probably incomprehensible to most mathematicians for lack of context; maybe I'll do some more substantive editing later. User:Michael Hardy 17:46 Mar 30, 2003 Doctrinaire - a person inflexibly attached to a practice or theory without regard to its practicality. [http://dictionary.reference.com/search?q=doctrinaire Online dictionary definition]. Hey, I just tried to describe what it is - whether or not it's valid is an issue that requires its own subsection. Since most of what I've read on the subject of the validity of PME was written by its proponents, I have the information to give only one side of the story (from as N a POV as I can manage).User:Cyan 07:37 Apr 1, 2003 (UTC) I don't claim to be a mathematician, and yet with a few terms of calculus and discrete math under my belt I find this presentation to be very accessible. I don't see how it can be made any more accessible without sacrificing content. I learned a few new things from this page (ie proving that ME solution is also ML solution) that I haven't come across when browsing papers on maxent. I have some knowledge on how the algorithms that approximate maximum entropy solution work (the GIS and the IIS), if there's demand for it, perhaps I should post some info? User:yaroslavvb Jun 3, 2003 (PST) :Absolutely. But the PME page is already rather long. I suggest you create a new page (or pages) for these algorithms, and provide links to and from the PME page. User:Cyan 04:35 5 Jun 2003 (UTC) See also the second rule 25 on [http://meta.wikipedia.org/wiki/Wikipedia_Anti-rules Wikipedia Anti-Rules]. User:Cyan 21:53 Apr 3, 2003 (UTC) What I meant by "doctrinaire" is that it imitates closely the language of Edwin Jaynes and may be incomprehensible to those unfamiliar with Jaynes' writings. One of these days I'll edit this article, but for the Time Being I have other obligations. User:Michael Hardy 01:36 Apr 4, 2003 (UTC)


See other meanings of words starting from letter:

P

PA | PB | PC | PD | PE | PF | PG | PH | PI | PJ | PK | PL | PM | PN | PO | PR | PS | PT | PU | PW | PX | PY | PZ |

Words begining with Principle_of_maximum_entropy:

Principle_of_Maximum_Entropy
Principle_of_maximum_entropy
Principle_of_maximum_entropy


These materials are based on Wikipedia and licensed under the GNU FDL



YouTube.com videos better site than Turbo Tax 2007
encyklopedia online