A pattern to create flexible analysis workflows.

The point of this blog post is to walk through a pattern I’ve started using in some of my analyses that combines `do.call()`

, `purrr::pmap()`

, and some wrapper functions to customize how a given analysis gets run. I’ll start by demonstrating `do.call()`

and `pmap()`

separately, then showing how you can use them together to do some cool things. I’m not going to go super in-depth on either `do.call()`

or `pmap()`

, so it might be worthwhile to look into some of the documentation for those functions separately.

Also – I’m going to use the `{palmerpenguins}`

data here to illustrate this workflow. And, like, as is typically the case with toy data, the point here isn’t to run a suite of analyses that answer meaningful questions about this data, but rather to demonstrate how to combine these functions in a way that could help you answer meaningful questions for your own data.

With all of that said, onward and upward!

To start, let’s load the packages we’ll need.

```
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(tidyverse)
library(palmerpenguins)
```

```
glimpse(penguins)
```

```
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Ad~
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen~
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39~
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19~
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193~
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 46~
$ sex <fct> male, female, female, NA, female, male, fe~
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, ~
```

Cool cool. Now, let’s assume we want to analyze this penguins data. Let’s say we want to estimate a mean, a correlation coefficient, and fit a linear regression, and that this is our workflow (n.b. again that this probably shouldn’t be your *actual* workflow when you analyze data).

```
mean(penguins$body_mass_g, na.rm = TRUE)
```

`[1] 4201.754`

Another way we can do the exact same thing is with `do.call()`

. `do.call()`

has a “what” argument, to which you provide the function you want to call (or the character string name of the function), and an “args” argument, where you list the arguments to pass to “what”. It has some other arguments, too, but I’m going to ignore those here. So, the call below does the exact same thing we did previously:

The nice thing about do.call is that it’s very flexible. Say we wanted to run a correlation between body mass and bill depth. We can do this by directly calling the `cor()`

function:

```
# option 1:
cor(
x = penguins$body_mass_g,
y = penguins$bill_depth_mm,
use = "pairwise.complete.obs"
)
```

`[1] -0.4719156`

Or we can do the exact same thing via `do.call()`

:

```
# option 2
do.call("cor",
args = list(
x = penguins$body_mass_g,
y = penguins$bill_depth_mm,
use = "pairwise.complete.obs"
)
)
```

`[1] -0.4719156`

Or say we wanted to run a linear regression with body mass regressed on bill depth and sex. Again, we can call `lm()`

directly:

```
# option 1:
res1 <- lm(body_mass_g ~ bill_depth_mm + sex, data = penguins, na.action = "na.omit")
broom::glance(res1)
```

```
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.642 0.640 483. 296. 2.34e-74 2 -2529. 5066.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
```

Or via `do.call()`

:

```
#option 2
res2 <- do.call("lm", args = list(
formula = body_mass_g ~ bill_depth_mm + sex,
data = penguins,
na.action = "na.omit"
))
broom::glance(res2)
```

```
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.642 0.640 483. 296. 2.34e-74 2 -2529. 5066.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
```

Just based on the above, `do.call()`

isn’t really doing anything useful for us. It’s just a slightly more verbose way to call a function. But where `do.call()`

really shines is when you pair it with some iteration – which we’ll do now, via `purrr::pmap()`

– and/or some conditional logic (which we’ll add later via a wrapper function). Basically it shines with you program with it, is what I’m trying to say.

For those that don’t know, `purrr::pmap()`

extends `purrr::map()`

to allow for an arbitrary number of arguments to map over in parallel. If you’re not familiar with `purrr::map()`

, Hadley’s R for Data Science book has a good chapter on it. But anyway, let’s illustrate `pmap()`

by running a handful of correlations on some sample data

This gives us a list of x and y values, where the first element of `x`

is `a`

, the first element of `y`

is `b`

, etc etc. We can run a bunch of correlations – `x[[1]]`

with `y[[1]]`

, `x[[2]]`

with `y[[2]]`

etc – by using `pmap()`

and `cor()`

:

```
pmap(sample_args, ~cor(..1, ..2, use = "pairwise.complete.obs"))
```

```
[[1]]
[1] -0.08180587
[[2]]
[1] 0.01490455
[[3]]
[1] 0.1326938
```

Which can be a helpful pattern.

What’s potentially more interesting, though, is that we can also use `pmap()`

in conjunction with `do.call()`

to not only iterate through arguments passed to a given function (like we do with `cor()`

above), but to also iterate over various functions:

```
#create a vector of function names
funs <- c("mean", "cor", "lm")
#create a list of function arguments, where each element of the list is a list of args
fun_args <- list(
list(penguins$body_mass_g, na.rm = TRUE),
list(
penguins$body_mass_g,
penguins$bill_depth_mm,
use = "pairwise.complete.obs"
),
list(
formula = body_mass_g ~ bill_depth_mm + sex,
data = penguins,
na.action = "na.omit"
)
)
#combine the function names and args into a tibble
fun_iterator <- tibble(
f = funs,
fa = fun_args
)
#take a look at the tibble
glimpse(fun_iterator)
```

```
Rows: 3
Columns: 2
$ f <chr> "mean", "cor", "lm"
$ fa <list> [<3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 425~
```

What we’re doing in the above code is:

- creating a list of function names;
- creating a list of function arguments (where each element of the list is a list of args);
- binding these lists together in a tibble.

Then, we can then execute all of these functions with their corresponding arguments with `do.call()`

:

```
res <- pmap(fun_iterator, ~ do.call(..1, ..2))
```

Within `do.call()`

, we’re passing the first column of our `fun_iterator`

table to the first argument of `do.call()`

(as denoted by ..1), and the second column of the tibble to the second argument of `do.call()`

(as denoted by ..2). This will give us a list, `res`

, where each element is the result of the function/argument combination in our `fun_iterator`

tibble.

To prove it worked, let’s look at the results:

```
#mean
res[[1]]
```

`[1] 4201.754`

```
#cor
res[[2]]
```

`[1] -0.4719156`

```
#lm
broom::glance(res[[3]])
```

```
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.642 0.640 483. 296. 2.34e-74 2 -2529. 5066.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
```

In theory, you can specify an entire set of analyses ahead of time and then execute them using `pmap()`

+ `do.call()`

if you wanted to. So let’s at one way we might do that via a wrapper function.

The real power of this is to write a function that wraps all of these components and allows you to run just a subset of them. And this is how I actually use this pattern in my own work. But I’ll touch on some real-world applications after we go through the code below.

Let’s start by writing a wrapper function that has 1 argument, `include`

, where `include`

is a character vector of function names.

```
analyze_penguins <- function(include = c("mean", "cor", "lm")) {
#some code here
}
```

Then let’s drop all of the code that we just ran into the function:

```
analyze_penguins <- function(include = c("mean", "cor", "lm")) {
#we already ran all of this
funs <- c("mean", "cor", "lm")
fun_args <- list(
list(penguins$body_mass_g, na.rm = TRUE),
list(
penguins$body_mass_g,
penguins$bill_depth_mm,
use = "pairwise.complete.obs"
),
list(
formula = body_mass_g ~ bill_depth_mm + sex,
data = penguins,
na.action = "na.omit"
)
)
fun_iterator <- tibble(
f = funs,
fa = fun_args
)
}
```

And then we subset the `fun_iterator`

tibble to only include the functions we include in the `include`

argument of our wrapper function, and executed only those functions via `pmap()`

+ `do.call()`

:

```
analyze_penguins <- function(include = c("mean", "cor", "lm")) {
#this is all the same as previously
funs <- c("mean", "cor", "lm")
fun_args <- list(
list(penguins$body_mass_g, na.rm = TRUE),
list(
penguins$body_mass_g,
penguins$bill_depth_mm,
use = "pairwise.complete.obs"
),
list(
formula = body_mass_g ~ bill_depth_mm + sex,
data = penguins,
na.action = "na.omit"
)
)
fun_iterator <- tibble(
f = funs,
fa = fun_args
)
# filter to only a subset of these functions that we've asked for in the wrapper args
fun_iterator <- fun_iterator[fun_iterator$f %in% include, ]
#execute these functions
pmap(fun_iterator, ~do.call(..1, ..2))
}
```

So, say we just wanted the mean:

```
analyze_penguins("mean")
```

```
[[1]]
[1] 4201.754
```

Or just the mean and the correlation:

```
analyze_penguins(c("mean", "cor"))
```

```
[[1]]
[1] 4201.754
[[2]]
[1] -0.4719156
```

Or just the linear model:

```
broom::glance(analyze_penguins("lm")[[1]])
```

```
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.642 0.640 483. 296. 2.34e-74 2 -2529. 5066.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
```

I really like this pattern for data cleaning. I have a handful of demographic variables that I regularly work with that need to be cleaned and/or recoded, and I have some helper functions I’ve written to clean/recode each of them individually. But I also have a “meta” `recode_demographics()`

function that can execute any combination of my helper functions depending on what I need for a given project. You can obviously also write your wrapper function to give you more control over the arguments to each constituent function (like by allowing you to pass in a formula to `lm()`

, for instance, rather than hardcoding your formula), which can make this whole approach very flexible! It can be a bit time-consuming to write a wrapper that gives you the right level of flexibility, but if you have a set of related tasks you do frequently, I think it’s worth the time to figure out.

For attribution, please cite this work as

EE (2022, March 15). Eric Ekholm: Combining pmap and do.call. Retrieved from https://www.ericekholm.com/posts/2022-03-15-combining-pmap-and-docall/

BibTeX citation

@misc{ee2022combining, author = {EE, }, title = {Eric Ekholm: Combining pmap and do.call}, url = {https://www.ericekholm.com/posts/2022-03-15-combining-pmap-and-docall/}, year = {2022} }