Learning maximum likelihood estimation by fitting logistic regression ‘by hand’ (sort of)

Julia

Learning Out Loud

Maximum Likelihood

Logistic Regression

Published

September 28, 2022

In a previous post, I did some “learning out loud” by practicing estimating a few models via maximum likelihood by hand. In this short blog, I figured I could extend this learning by applying what I learned previously to logistic regression.

As a reminder, the point of these “learning out loud” posts is to give myself a medium to work through concepts. Hopefully these metacognitive exercises will benefits others, too. The concepts I’m covering here are things that I’m either learning anew or brushing back up on after not using for a while. But either way, I’m not trying to portray myself as an expert. If you are an expert and you notice I’m doing something wrong, I’d love to hear from you!

Stating the Problem

So, what I want to do here is get point estimates for the coefficients in a logistic regression model “by hand” (or mostly by hand). I’m going to be doing this in Julia, because I’m also interested in getting better at Julia stuff, but obviously the concepts are the same across any programming language.

Setup

First, we’ll load the libraries we’re using here and set a seed:

usingGLM #to check my work againstusingDistributions #for the Bernoulli distributionusingRandom #to set a seedusingOptim #to do the acutal optimizingusingStatistics #mean and stdusingRDatasets #to get dataRandom.seed!(0408)

TaskLocalRNG()

Load and Preprocess Data

Next, we’ll load in some data and do some light preprocessing. We’ll use the Default data from the RDatasets package, which presents features describing a given person as well as a binary indicator of whether they defaulted on a credit card payment.

After loading the data, we’ll pull out the default variable, dummy code it, and then assign it to a vector called y. We’ll also select just the “balance” and “income” columns of the data and assign those to X. There are other columns we could use as predictors, but that’s not really the point here.

data = RDatasets.dataset("ISLR", "Default")y = [r.Default =="Yes" ? 1:0 for r ineachrow(data)]X = data[:, [:Balance, :Income]]

10,000 rows × 2 columns

Balance

Income

Float64

Float64

1

729.526

44361.6

2

817.18

12106.1

3

1073.55

31767.1

4

529.251

35704.5

5

785.656

38463.5

6

919.589

7491.56

7

825.513

24905.2

8

808.668

17600.5

9

1161.06

37468.5

10

0.0

29275.3

11

0.0

21871.1

12

1220.58

13268.6

13

237.045

28251.7

14

606.742

44994.6

15

1112.97

23810.2

16

286.233

45042.4

17

0.0

50265.3

18

527.54

17636.5

19

485.937

61566.1

20

1095.07

26464.6

21

228.953

50500.2

22

954.262

32457.5

23

1055.96

51317.9

24

641.984

30466.1

25

773.212

34353.3

26

855.009

25211.3

27

643.0

41473.5

28

1454.86

32189.1

29

615.704

39376.4

30

1119.57

16556.1

⋮

⋮

⋮

Next, we’ll z_score the predictor variables, convert them to a matrix, and append a column vector of ones to the matrix (so we can estimate the intercept). The mapcols() function from DataFrames.jl will apply the z_score function to all of the columns in X, which is actually only 2 in this case.

First we’ll define a z-score function

functionz_score(x) u =mean(x) s =std(x) res =Float64[]for i in1:lastindex(x) tmp = (x[i] - u) / spush!(res, tmp)endreturn resend

Next, we’ll write a logistic function that will implement the logistic transformation. This is built into the StatsFuns.jl package, but I want to write it out by hand to reinforce what it is. We’ll use this to predict y values with a given input (which will actually be X*\(\beta\))

my_logistic(x) =exp(x) / (1+exp(x))

my_logistic (generic function with 1 method)

Define a Maximum Likelihood Estimator

Now that we have some data, we can write a function that uses maximum likelihood estimation to give us the best \(\beta\) parameters for our given X and y. If you want to brush up on maximum likelihood, you can read my previous “learning out loud” post, or you can probably find materials written by someone who knows way more than I do. Either way, I’m not going to recap what MLE is here.

Let’s define our function that we’ll use to estimate \(\beta\). The important thing to keep in mind is that the return value of this function isn’t the \(\beta\) values, but rather the negative log likelihood, since this is what we we want to optimize.

functionml_logreg(x, y, b) ŷ =my_logistic.(x*b) res =Float64[]for i in1:lastindex(y)push!(res, logpdf(Bernoulli(ŷ[i]), y[i]))end ret =-sum(res)return retend

ml_logreg (generic function with 1 method)

So what’s going on in this code?

We’re getting \(ŷ\) estimates for a given x and b by running them through the my_logistic() function. This will give us a 10000x1 vector

We’re instantiating an empty vector that will (eventually) contain Float64 values.

For each index in \(ŷ\) (i.e. 1 through 10000), we’re getting the log-likelihood of the true outcome (y[i]) given a Bernoulli distribution parameterized by success rate \(ŷ\)[i].

I think this is the trickiest part of the whole problem, so I want to put it into words to make sure I understand it. In our problem, our y values are either 0 or 1. And the output of the my_logistic() function is going to be, for each y, a predicted probability that \(y = 1\), i.e. a predicted success rate. Since a Bernoulli distribution is parameterized by a given success rate and models the outcome of a single yes/no (1/0) trial, it makes sense to use this to generate the likelihoods we want to maximize.

More concretely, the likelihoods we get will be dependent on:

the provided success rate p, and

the actual outcome

Where values of p that are closer to the actual outcome will be larger:

logpdf(Bernoulli(.5), 1)

-0.6931471805599453

#will be larger than the previouslogpdf(Bernoulli(.8), 1)

-0.2231435513142097

#will be even largerlogpdf(Bernoulli(.99), 1)

-0.01005033585350145

And inversely, you can imagine that if the outcome were 0, we’d want our predicted success rate to be very low.

Returning to our ml_logreg() function, what we’re doing then is applying this logic to all of our \(ŷ\) and corresponding y values (i.e. we’re getting the likelihood of y for a given \(ŷ\)), and then we’re creating a vector with all of these likelihoods – that’s what the push!(...) notation is doing – pushing these likelihoods to the empty float vector we created.

Finally, we’re summing all of our likelihoods and then multiplying the result by negative one, since the optimizer we’re using actually wants to minimize a loss function rather than maximize a loss function.

We can run this function by providing any X, y, and \(\beta\), and it’ll give us back a negative loglikelihood – the negative sum of all of the individual likelihoods.

#just some arbitrary numbers to test the function withstart_vals = [.1, .1, .1]ml_logreg(Xz, y, start_vals)

7372.506385031871

Optimize \(\beta\)

So the above gives us the likelihood for a starting value of \(\beta\), but we want to find the best values of \(\beta\). To do that, we can optimize the function. Like I said in my previous post, the optimizers are written by people much smarter than I am, so I’m just going to use that package rather than futz around with doing any, like, calculus by hand – although maybe that’s a topic for a later learning out loud post.

res =optimize(b ->ml_logreg(Xz, y, b), start_vals)

* Status: success
* Candidate solution
Final objective value: 7.894831e+02
* Found with
Algorithm: Nelder-Mead
* Convergence measures
√(Σ(yᵢ-ȳ)²)/n ≤ 1.0e-08
* Work counters
Seconds run: 0 (vs limit Inf)
Iterations: 103
f(x) calls: 190

And then we can get the \(\beta\) coefficients that minimize the loss function (i.e. that maximize the likelihood)