```
using Statistics
using Distributions
using CairoMakie #for plotting
using Random #to set a seed
Random.seed!(0408)
```

`TaskLocalRNG()`

Using Julia to generate a dataset with a given correlation

Julia

Tutorial

Brief

Author

EE

Published

September 8, 2022

This is going to be a short one, but I saw a comment on Twitter recently about an interview question where someone was asked to generate a dataset with variables X and Y that are correlated at *r* = .8. So I figured I’d write out some code that does this as a way to practice in Julia a little bit more.

First we load our packages

```
using Statistics
using Distributions
using CairoMakie #for plotting
using Random #to set a seed
Random.seed!(0408)
```

`TaskLocalRNG()`

The approach here is going to be to define a covariance (correlation) matrix and a vector of means, then define a multivariate normal distribution parameterized by these things. We’ll then use this distribution to generate our data.

First we’ll define \(\Sigma\), which is our covariance matrix. Since we’re generating a dataset with only 2 variables, this will be a 2x2 matrix, where the diagonals will be 1 and the off-diagonals will be .8, which is the correlation we want between X and Y.

Then we’ll define a mean vector. This will be a 2-element vector (one for each variable), but we don’t actually care what the values are here, so let’s just make them 0.

```
2-element Vector{Float64}:
0.0
0.0
```

Now we can define a distribution given \(\Sigma\) and \(\mu\)

And then we can draw a sample from this distribution

```
2×200 Matrix{Float64}:
-1.40556 0.469524 -1.19092 -0.40408 … -0.244792 0.874835 -0.719764
-0.595655 1.01141 -1.84189 -0.550097 0.250661 1.72269 -0.862095
```

To confirm this works like expected, we can plot the sample

It looks like a .8 correlation to me. But to do a final check, we can get the correlation matrix of our sample.

```
2×2 Matrix{Float64}:
1.0 0.769654
0.769654 1.0
```

Close enough. Our correlation won’t be *exactly* equal to .8 using this approach since we’re sampling from a distribution, but there’s really no difference (imo) between a .77 correlation and a .80 correlation.

BibTeX citation:

```
@online{ekholm2022,
author = {Eric Ekholm and EE},
title = {Generating {Data} with a {Given} {Correlation}},
date = {2022/09/08},
url = {https://www.ericekholm.com/posts/cor-generate-data},
langid = {en}
}
```

For attribution, please cite this work as:

Eric Ekholm, and EE. 2022–9AD. “Generating Data with a Given
Correlation.” 2022–9AD. https://www.ericekholm.com/posts/cor-generate-data.