|
|
Maximum Likelihood#REDIRECT Maximum likelihood Maximum likelihoodMaximum likelihood estimation (MLE) is a popular statistics method used to make inferences about parameters of the underlying probability distribution of a given data set. The method was pioneered by geneticist and statistician Ronald Fisher between 1912 and 1922 (see external resources below for more information on the history of MLE). ==Prerequisites== The following discussion assumes that the reader is familiar with basic notions in probability theory such as probability distributions, probability density functions, random variables and expected value. It also assumes s/he is familiar with standard basic techniques of maximising continuous function real number function (mathematics)s, such as using differentiation to find a fixed point. ==The philosophy of MLE== Given a discrete probability distribution with known probability mass function and distributional parameter , we may draw a sample of values from this distribution and then using the mass function we may compute the probability associated with our observed data: : However, it may be that we don't know the value of the parameter despite knowing (or believing) that our data comes from the distribution . How should we estimate ? It is a sensible idea to draw a sample of values and use this data to help us make an estimate. Once we have our sample , we may seek an estimate of the value of from that sample. MLE seeks the most likely value of the parameter (ie we maximise the ''likelihood'' of the observed data set over all possible values of ). This is in contrast to seeking other estimators, such as an unbiased estimator of , which may not necessarily yield the most likely value of but which will yield a value that (on average) will neither tend to over-estimate nor under-estimate the true value of . To implement the MLE method mathematically, we define the likelihood: : and maximise this function (mathematics) over all possible values of the parameter . The value which maximises the likelihood is known as the maximum likelihood estimator (MLE) for . ===Notes=== *The likelihood is a function of for fixed values of . *The maximum likelihood estimator may not be unique, or indeed may not even exist. *The above definition also applies to continuous probability distributions if we think of as the probability density function instead of the probability mass function. ==Examples== ===Discrete distribution, discrete and finite parameter space=== Consider tossing an unfair coin 80 times (ie we sample something like , , , and count the number of HEADS observed). Call the probability of tossing a HEAD , and the probability of tossing TAILS (so here is the parameter which we referred to as above). Suppose we toss 49 HEADS and 31 TAILS, and suppose the coin was taken from a box containing three coins: one which gives HEADS with probability , one which gives HEADS with probability and another which gives heads with probability . The coins have lost their labels, so we don't know which one it was. Using maximum likelihood estimation we can calculate which coin it was most likely to have been, given the data that we observed. The likelihood function (defined above) takes one of three values: :: We see that the likelihood is maximised by parameter , and so this is our ''maximum likelihood estimate'' for . ===Discrete distribution, continuous parameter space=== Now suppose our special box of coins from example 1 contains an infinite number of coins: one for every possible value . We must maximise the likelihood function: : over all possible values . One may maximise this function by differentiation with respect to and setting to zero: : mode_(statistics)_with_the_peak_(maximum)_of_the_curve.">image:BinominalLikelihoodGraph.png|thumb|200px|Likelihood of different proportion parameter values for a binomial process with ''t'' = 3 and ''n'' = 10; the ML estimator occurs at the mode (statistics) with the peak (maximum) of the curve. which has solutions . The solution which maximises the likelihood is clearly (since and result in a likelihood of zero). Thus we say the ''maximum likelihood estimator'' for is . This result is easily generalised by substituting a letter such as in the place of 49 to represent the observed number of 'successes' of our Bernoulli trials, and a letter such as in the place of 80 to represent the number of Bernoulli trials. Exactly the same calculation yields the ''maximum likelihood estimator'': : for any sequence of Bernoulli trials resulting in 'successes'. ===Continuous distribution, continuous parameter space=== One of the most common continuous probability distributions is the Normal distribution which has probability density function: : The corresponding density function for a sample of independent identically distributed normal random variables is: : or more conveniently: : This distribution has two parameters: . This may be alarming to some, given that in the discussion above we only talked about maximising over a single parameter. However there is no need for alarm: we simply maximise the likelihood over each parameter separately, which of course is more work but no more complicated. In the above notation we would write . When maximising the likelihood, we may equivalently maximise the natural logarithm of the likelihood, since it is a continuous function strictly increasing function over the range (mathematics) of the likelihood. [Note: the log-likelihood is closely related to information entropy and Fisher information ]. This often simplifies the algebra somewhat, and indeed does so in this case: : which is solved by . This is indeed the maximum of the function since it is the only turning point in and the second derivative is strictly less than zero. Similarly we differentiate with respect to and equate to zero to obtain the maximum of the likelihood . This is left as an exercise to the reader. Formally we say that the maximum likelihood estimator for is: :. ==Properties== ===Functional invariance=== If is the maximum likelihood estimator (MLE) for , then the MLE for is (provided the function is a one to one function). ===Asymptotic behaviour=== Maximum likelihood estimators achieve minimum variance (as given by the Cramer-Rao lower bound) in the limit as the sample size tends to infinity. When the MLE is unbiased, we may equivalently say that it has minimum mean squared error in the limit. ===Bias=== The unbiased estimator of maximum-likelihood estimators can be substantial. Consider a case where ''n'' tickets numbered from 1 to ''n'' are placed in a box and one is selected at random (''see uniform distribution''). If ''n'' is unknown, then the maximum-likelihood estimator of ''n'' is the value on the drawn ticket, even though the expectation is only . In estimating the highest number ''n'', we can only be certain that it is greater than or equal to the drawn ticket number. ==See also== * The mean squared error is a measure of how 'good' an estimator of a distributional parameter is (be it the maximum likelihood estimator or some other estimator). * The article on the Rao-Blackwell theorem for a discussion on finding the best possible unbiased estimator (in the sense of having minimal mean squared error ) by a process called Rao-Blackwellisation. The MLE is often a good starting place for the process. * The reader may be intrigued to learn that the MLE (if it exists) will always be a function of a sufficient statistic for the parameter in question. ==External resources== * [http://projecteuclid.org/Dienst/UI/1.0/Summarize/euclid.ss/1030037906 A paper detailing the history of maximum likelihood, written by John Aldrich] Estimation theory Statistics Maximum likelihoodI removed this from the article, until it can be made more NPOV and more encyclopedic. Currently reads more like a list of observations than true encyclopedic content and needs more explanation. --User:Lexor|User talk:Lexor 07:25, 5 Aug 2004 (UTC) :''Maximum likelihood is one of the main methods used by frequentist (i.e. non-Bayesian) statisticians. Bayesian arguments against the ML and other point estimation methods are that'' :* ''all the information contained in the data is in the likelihood function, so why use just the maximum?Bayesian methods use ALL of the likelihood function and this is why they are optimal.'' :* ''ML methods have good asymptotic properties (consistency and attainment of the Cramer-Rao lower bound) but there is nothing to recommend them for analysis of small samples'' :* ''the method doesn't work so well with distributions that have many modes or unusual shapes. Apart from the practical difficulties of getting stuck in local modes, there is the difficulty of interpreting the output, which consists of a point estimate plus standard error. Suppose you have a distribution for a quantity that can only take positive values, and your ML estimate for the mean comes out at 1.0 with a standard error of 3? Bayesian methods gives you the entire posterior distribution as output, so you can make sense of it and then decide what summaries are appropriate.'' This whole article reads like an 80's highschool textbook. As a matter of fact, a lot wikipedia's articles on difficult to understand subjects read like they're out of an 80's highschool textbook, making them only useful to people who have already know the subject back to front, making the entire wikipedia project a failure. :Clarity issues in a statistics article (a subject that is less than clear) should not be used to make the inference that "the entire wikipedia project a failure" User:Rschulz 23:56, 1 Mar 2005 (UTC) :This article was never intended to be accessible to secondary school students generally (although certainly there are those among them who would understand it). I would consider this article successful if mathematicians who know probability theory but do not know statistics can understand it. And I think by that standard it is fairly successful, although more examples and more theory could certainly be added. If someone can make this particular topic comprehensible to most people who know high-school mathematics, through first-year calculus or perhaps through the prerequisites to first-year calculus, I would consider that a substantially greater achievement. But that would tak more work. User:Michael Hardy 00:40, 10 Jan 2005 (UTC) ::As a university student learning statistics, I think this article needs improvement. It would be good if a graph of the likelihood of different parameter values for p was added (with the maximum pointed out) to the example. This addition would require adding some specific data to the example. Also, the example should be separated from the discussion about MLE, to make sure people understand that the binominal distribution is only used for this case. The reasons why it is good to take the log of likelihood are not discussed. Further the discussion about what makes a good estimator (and how MLE is related to other estimators) could be expanded. User:Rschulz 23:56, 1 Mar 2005 (UTC) : a user in CA Good article, don't be so hard on the author(s), of course could be better, but most of us have day jobs, but I would change the notation, as this was confusing ''" The value (lower-case) x/n observed in a particular case is an estimate; the random variable (Capital) X/n is an estimator."'' seems to conflict with the excellent example at the end for finding maximum likelihood x/n in a bionomial distribution of x voters in a sample of n (without replacement). Now, next question, can anybody explain the Viterbi algorithm to a high-schooler? 01 March 2004 :I don't see the conflict. The lower-case ''x'' in the example at the end is not a random variable, but an observed realization. User:Michael Hardy 22:50, 2 Mar 2005 (UTC) See other meanings of words starting from letter: MMA | MB | MC | MD | ME | MF | MG | MH | MI | MJ | MK | ML | MN | MO | MP | MR | MS | MT | MU | MW | MX | MY | MZ |Words begining with Maximum_likelihood: Maximum_Likelihood Maximum_likelihood Maximum_likelihood Maximum_likelihood_decoding Maximum_Likelihood_Estimation Maximum_likelihood_estimation Maximum_likelihood_estimator Maximum_likelihood_method Maximum_likelihood_principle |
These materials are based on Wikipedia and licensed under the GNU FDL
YouTube.com videos better site than Turbo Tax 2007 |
|
|