MLE Learning Out Loud 2: Logistic Regression

Learning maximum likelihood estimation by fitting logistic regression ‘by hand’ (sort of)

Julia
Learning Out Loud
Maximum Likelihood
Logistic Regression
Published

September 28, 2022

In a previous post, I did some “learning out loud” by practicing estimating a few models via maximum likelihood by hand. In this short blog, I figured I could extend this learning by applying what I learned previously to logistic regression.

As a reminder, the point of these “learning out loud” posts is to give myself a medium to work through concepts. Hopefully these metacognitive exercises will benefits others, too. The concepts I’m covering here are things that I’m either learning anew or brushing back up on after not using for a while. But either way, I’m not trying to portray myself as an expert. If you are an expert and you notice I’m doing something wrong, I’d love to hear from you!

Stating the Problem

So, what I want to do here is get point estimates for the coefficients in a logistic regression model “by hand” (or mostly by hand). I’m going to be doing this in Julia, because I’m also interested in getting better at Julia stuff, but obviously the concepts are the same across any programming language.

Setup

First, we’ll load the libraries we’re using here and set a seed:

using GLM #to check my work against
using Distributions #for the Bernoulli distribution
using Random #to set a seed
using Optim #to do the acutal optimizing
using Statistics #mean and std
using RDatasets #to get data

Random.seed!(0408)
TaskLocalRNG()

Load and Preprocess Data

Next, we’ll load in some data and do some light preprocessing. We’ll use the Default data from the RDatasets package, which presents features describing a given person as well as a binary indicator of whether they defaulted on a credit card payment.

After loading the data, we’ll pull out the default variable, dummy code it, and then assign it to a vector called y. We’ll also select just the “balance” and “income” columns of the data and assign those to X. There are other columns we could use as predictors, but that’s not really the point here.

data = RDatasets.dataset("ISLR", "Default")

y = [r.Default == "Yes" ? 1 : 0 for r in eachrow(data)]

X = data[:, [:Balance, :Income]]

10,000 rows × 2 columns

Balance Income
Float64 Float64
1 729.526 44361.6
2 817.18 12106.1
3 1073.55 31767.1
4 529.251 35704.5
5 785.656 38463.5
6 919.589 7491.56
7 825.513 24905.2
8 808.668 17600.5
9 1161.06 37468.5
10 0.0 29275.3
11 0.0 21871.1
12 1220.58 13268.6
13 237.045 28251.7
14 606.742 44994.6
15 1112.97 23810.2
16 286.233 45042.4
17 0.0 50265.3
18 527.54 17636.5
19 485.937 61566.1
20 1095.07 26464.6
21 228.953 50500.2
22 954.262 32457.5
23 1055.96 51317.9
24 641.984 30466.1
25 773.212 34353.3
26 855.009 25211.3
27 643.0 41473.5
28 1454.86 32189.1
29 615.704 39376.4
30 1119.57 16556.1

Next, we’ll z_score the predictor variables, convert them to a matrix, and append a column vector of ones to the matrix (so we can estimate the intercept). The mapcols() function from DataFrames.jl will apply the z_score function to all of the columns in X, which is actually only 2 in this case.

First we’ll define a z-score function

function z_score(x)
    u = mean(x)
    s = std(x)

    res = Float64[]
    for i in 1:lastindex(x)
        tmp = (x[i] - u) / s
        push!(res, tmp)
    end

    return res
end
z_score (generic function with 1 method)

And then we’ll actually apply it.

Xz = hcat(ones(length(y)), Matrix(mapcols(z_score, X)))
10000×3 Matrix{Float64}:
 1.0  -0.218824    0.813147
 1.0  -0.037614   -1.60542
 1.0   0.492386   -0.131206
 1.0  -0.632861    0.164023
 1.0  -0.102786    0.370897
 1.0   0.174098   -1.95142
 1.0  -0.0203871  -0.645722
 1.0  -0.0552131  -1.19344
 1.0   0.673295    0.296293
 1.0  -1.727      -0.31805
 1.0  -1.727      -0.873227
 1.0   0.796355   -1.51825
 1.0  -1.23695    -0.394799
 ⋮                
 1.0  -1.727       0.616625
 1.0   0.338849   -1.01252
 1.0  -0.957166   -0.610505
 1.0  -0.36504     1.59599
 1.0   0.571147    0.897805
 1.0   0.213889    1.73331
 1.0  -1.37056    -1.39173
 1.0  -0.255977    1.46029
 1.0  -0.160036   -1.03896
 1.0   0.02075     1.88347
 1.0   1.51667     0.236351
 1.0  -1.31163    -1.24874

Define a Logistic Function

Next, we’ll write a logistic function that will implement the logistic transformation. This is built into the StatsFuns.jl package, but I want to write it out by hand to reinforce what it is. We’ll use this to predict y values with a given input (which will actually be X*\(\beta\))

my_logistic(x) = exp(x) / (1 + exp(x))
my_logistic (generic function with 1 method)

Define a Maximum Likelihood Estimator

Now that we have some data, we can write a function that uses maximum likelihood estimation to give us the best \(\beta\) parameters for our given X and y. If you want to brush up on maximum likelihood, you can read my previous “learning out loud” post, or you can probably find materials written by someone who knows way more than I do. Either way, I’m not going to recap what MLE is here.

Let’s define our function that we’ll use to estimate \(\beta\). The important thing to keep in mind is that the return value of this function isn’t the \(\beta\) values, but rather the negative log likelihood, since this is what we we want to optimize.

function ml_logreg(x, y, b)

= my_logistic.(x*b)
    res = Float64[]

    for i in 1:lastindex(y)
        push!(res, logpdf(Bernoulli(ŷ[i]), y[i]))
    end

    ret = -sum(res)

    return ret
end
ml_logreg (generic function with 1 method)

So what’s going on in this code?

  1. We’re getting \(ŷ\) estimates for a given x and b by running them through the my_logistic() function. This will give us a 10000x1 vector
  2. We’re instantiating an empty vector that will (eventually) contain Float64 values.
  3. For each index in \(ŷ\) (i.e. 1 through 10000), we’re getting the log-likelihood of the true outcome (y[i]) given a Bernoulli distribution parameterized by success rate \(ŷ\)[i].

I think this is the trickiest part of the whole problem, so I want to put it into words to make sure I understand it. In our problem, our y values are either 0 or 1. And the output of the my_logistic() function is going to be, for each y, a predicted probability that \(y = 1\), i.e. a predicted success rate. Since a Bernoulli distribution is parameterized by a given success rate and models the outcome of a single yes/no (1/0) trial, it makes sense to use this to generate the likelihoods we want to maximize.

More concretely, the likelihoods we get will be dependent on:

  1. the provided success rate p, and
  2. the actual outcome

Where values of p that are closer to the actual outcome will be larger:

logpdf(Bernoulli(.5), 1)
-0.6931471805599453
#will be larger than the previous
logpdf(Bernoulli(.8), 1)
-0.2231435513142097
#will be even larger
logpdf(Bernoulli(.99), 1)
-0.01005033585350145

And inversely, you can imagine that if the outcome were 0, we’d want our predicted success rate to be very low.

Returning to our ml_logreg() function, what we’re doing then is applying this logic to all of our \(ŷ\) and corresponding y values (i.e. we’re getting the likelihood of y for a given \(ŷ\)), and then we’re creating a vector with all of these likelihoods – that’s what the push!(...) notation is doing – pushing these likelihoods to the empty float vector we created.

Finally, we’re summing all of our likelihoods and then multiplying the result by negative one, since the optimizer we’re using actually wants to minimize a loss function rather than maximize a loss function.

We can run this function by providing any X, y, and \(\beta\), and it’ll give us back a negative loglikelihood – the negative sum of all of the individual likelihoods.

#just some arbitrary numbers to test the function with
start_vals = [.1, .1, .1]

ml_logreg(Xz, y, start_vals)
7372.506385031871

Optimize \(\beta\)

So the above gives us the likelihood for a starting value of \(\beta\), but we want to find the best values of \(\beta\). To do that, we can optimize the function. Like I said in my previous post, the optimizers are written by people much smarter than I am, so I’m just going to use that package rather than futz around with doing any, like, calculus by hand – although maybe that’s a topic for a later learning out loud post.

res = optimize(b -> ml_logreg(Xz, y, b), start_vals)
 * Status: success

 * Candidate solution
    Final objective value:     7.894831e+02

 * Found with
    Algorithm:     Nelder-Mead

 * Convergence measures
    √(Σ(yᵢ-ȳ)²)/n ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    103
    f(x) calls:    190

And then we can get the \(\beta\) coefficients that minimize the loss function (i.e. that maximize the likelihood)

Optim.minimizer(res)
3-element Vector{Float64}:
 -6.125561839584853
  2.731586594783831
  0.2775242967112382

And just to confirm that we did this correctly, we can check our point estimates against what we’d get if we fit the model using the GLM package.

logreg_res = glm(Xz, y, Binomial())
GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, LogitLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}:

Coefficients:
─────────────────────────────────────────────────────────────────
        Coef.  Std. Error       z  Pr(>|z|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────
x1  -6.12557    0.187562   -32.66    <1e-99  -6.49318   -5.75795
x2   2.73159    0.109984    24.84    <1e-99   2.51602    2.94715
x3   0.277522   0.0664854    4.17    <1e-04   0.147213   0.407831
─────────────────────────────────────────────────────────────────

Cool beans!

Reuse

Citation

BibTeX citation:
@online{ekholm2022,
  author = {Ekholm, Eric},
  title = {MLE {Learning} {Out} {Loud} 2: {Logistic} {Regression}},
  date = {2022-09-28},
  url = {https://www.ericekholm.com/posts/mle-logreg-julia},
  langid = {en}
}
For attribution, please cite this work as:
Ekholm, Eric. 2022. “MLE Learning Out Loud 2: Logistic Regression.” September 28, 2022. https://www.ericekholm.com/posts/mle-logreg-julia.