A matsindf_apply primer
Matthew Kuperus Heun
2024-02-01
Source:vignettes/midf_apply_primer.Rmd
midf_apply_primer.Rmd
Introduction
matsindf_apply()
is a powerful and versatile function
that enables analysis with lists and data frames by applying
FUN
in helpful ways. The function is called
matsindf_apply()
, because it can be used to apply
FUN
to a matsindf
data frame, a data frame
that contains matrices as individual entries in a data frame. (A
matsindf
data frame can be created by calling
collapse_to_matrices()
, as demonstrated below.)
But matsindf_apply()
can apply FUN
across
much more: data frames of single numbers, lists of matrices, lists of
single numbers, and individual numbers. This vignette demonstrates
matsindf_apply()
, starting with simple examples and
proceeding toward sophisticated analyses.
The basics
The basis of all analyses conducted with
matsindf_apply()
is a function (FUN
) to be
applied across data supplied in .dat
or ...
.
FUN
must return a named list of variables as its result.
Here is an example function that both adds and subtracts its arguments,
a
and b
, and returns a list containing its
result, c
and d
.
example_fun <- function(a, b){
return(list(c = matsbyname::sum_byname(a, b),
d = matsbyname::difference_byname(a, b)))
}
Similar to lapply()
and its siblings, additional
argument(s) to matsindf_apply()
include the data over which
FUN
is to be applied. These arguments can, in the first
instance, be supplied as named arguments to the ...
argument of matsindf_apply()
. All arguments in
...
must be named. The ...
arguments to
matsindf_apply()
are passed to FUN
according
to their names. In this case, the output of
matsindf_apply()
is the the named list returned by
FUN
.
matsindf_apply(FUN = example_fun, a = 2, b = 1)
#> $c
#> [1] 3
#>
#> $d
#> [1] 1
Passing an additional argument (z = 2
) causes an unused
argument error, because example_fun
does not have a
z
argument.
tryCatch(
matsindf_apply(FUN = example_fun, a = 2, b = 1, z = 2),
error = function(e){e}
)
#> <simpleError in matsindf_apply_types(.dat, FUN, ...): In matsindf::matsindf_apply(), the following unused arguments appeared in ...: z>
Failing to pass a needed argument (b
) causes an error
that indicates the missing argument.
tryCatch(
matsindf_apply(FUN = example_fun, a = 2),
error = function(e){e}
)
#> Warning in matsindf_apply_types(.dat, FUN, ...): In matsindf::matsindf_apply(),
#> the following named arguments to FUN were not found in any of .dat, ..., or
#> defaults to FUN: b. Set .warn_missing_FUN_args = FALSE to suppress this warning
#> if you know what you are doing.
#> <simpleError in (function (a, b) { return(list(c = matsbyname::sum_byname(a, b), d = matsbyname::difference_byname(a, b)))})(a = 2): argument "b" is missing, with no default>
Alternatively, arguments to FUN
can be given in a named
list to .dat
, the first argument of
matsindf_apply()
. When a value is assigned to
.dat
, the return value from matsindf_apply()
contains all named variables in .dat
(in this case both
a
and b
) in addition to the results provided
by FUN
(in this case both c
and
d
).
matsindf_apply(list(a = 2, b = 1), FUN = example_fun)
#> $a
#> [1] 2
#>
#> $b
#> [1] 1
#>
#> $c
#> [1] 3
#>
#> $d
#> [1] 1
Extra variables are tolerated in .dat
, because
.dat
is considered to be a store of data from which
variables can be drawn as needed.
matsindf_apply(list(a = 2, b = 1, z = 42), FUN = example_fun)
#> $a
#> [1] 2
#>
#> $b
#> [1] 1
#>
#> $c
#> [1] 3
#>
#> $d
#> [1] 1
In contrast, arguments to ...
are named explicitly by
the user, so including an extra argument in ...
is
considered an error, as shown above.
Some details
If a named argument is supplied by both .dat
and
...
, the argument in ...
takes precedence,
overriding the argument in .dat
.
matsindf_apply(list(a = 2, b = 1), FUN = example_fun, a = 10)
#> $a
#> [1] 10
#>
#> $b
#> [1] 1
#>
#> $c
#> [1] 11
#>
#> $d
#> [1] 9
When supplying both .dat
and
...
, ...
can contain named strings of length
1
which are interpreted as mappings from named items in
.dat
to arguments in the signature of FUN
. In
the example below, a = "z"
indicates that argument
a
to FUN
should be supplied by item
z
in .dat
.
matsindf_apply(list(a = 2, b = 1, z = 42),
FUN = example_fun, a = "z")
#> $a
#> [1] 2
#>
#> $b
#> [1] 1
#>
#> $z
#> [1] 42
#>
#> $c
#> [1] 43
#>
#> $d
#> [1] 41
If a named argument appears in both .dat
and the output
of FUN
, a name collision occurs in the output of
matsindf_apply()
, and a warning is issued.
tryCatch(
matsindf_apply(list(a = 2, b = 1, c = 42), FUN = example_fun),
warning = function(w){w}
)
#> <simpleWarning in matsindf_apply(list(a = 2, b = 1, c = 42), FUN = example_fun): Name collision in matsindf::matsindf_apply(). The following arguments appear both in .dat and in the output of `FUN`: c>
FUN
can accept more than just numerics.
example_fun_with_string()
accepts a character string and a
numeric. However, because ...
argument that is a character
string of length 1
has special meaning (namely mapping
variables in .dat
to arguments of FUN
),
passing a character string of length 1
can cause an error.
To get around the problem, wrap the single string in a list, as shown
below.
example_fun_with_string <- function(str_a, b) {
a <- as.numeric(str_a)
list(added = matsbyname::sum_byname(a, b), subtracted = matsbyname::difference_byname(a, b))
}
# Causes an error
tryCatch(
matsindf_apply(FUN = example_fun_with_string, str_a = "1", b = 2),
error = function(e){e}
)
#> Warning in matsindf_apply_types(.dat, FUN, ...): In matsindf::matsindf_apply(),
#> the following named arguments to FUN were not found in any of .dat, ..., or
#> defaults to FUN: str_a. Set .warn_missing_FUN_args = FALSE to suppress this
#> warning if you know what you are doing.
#> <simpleError in (function (str_a, b) { a <- as.numeric(str_a) list(added = matsbyname::sum_byname(a, b), subtracted = matsbyname::difference_byname(a, b))})(b = 2): argument "str_a" is missing, with no default>
# To solve the problem, wrap "1" in list().
matsindf_apply(FUN = example_fun_with_string, str_a = list("1"), b = 2)
#> $added
#> [1] 3
#>
#> $subtracted
#> [1] -1
matsindf_apply(FUN = example_fun_with_string, str_a = list("1"), b = list(2))
#> $added
#> [1] 3
#>
#> $subtracted
#> [1] -1
matsindf_apply(FUN = example_fun_with_string,
str_a = list("1", "3"),
b = list(2, 4))
#> $added
#> $added[[1]]
#> [1] 3
#>
#> $added[[2]]
#> [1] 7
#>
#>
#> $subtracted
#> $subtracted[[1]]
#> [1] -1
#>
#> $subtracted[[2]]
#> [1] -1
matsindf_apply(.dat = list(str_a = list("1"), b = list(2)), FUN = example_fun_with_string)
#> $str_a
#> [1] "1"
#>
#> $b
#> [1] 2
#>
#> $added
#> [1] 3
#>
#> $subtracted
#> [1] -1
matsindf_apply(.dat = list(m = list("1"), n = list(2)), FUN = example_fun_with_string,
str_a = "m", b = "n")
#> $m
#> [1] "1"
#>
#> $n
#> [1] 2
#>
#> $added
#> [1] 3
#>
#> $subtracted
#> [1] -1
matsindf_apply()
and data frames
.dat
can also contain a data frame (or tibble), both of
which are fancy lists. When .dat
is a data frame or tibble,
the output of matsindf_apply()
is a tibble, and
FUN
acts like a specialized dplyr::mutate()
,
adding new columns at the right of .dat
.
matsindf_apply(.dat = data.frame(str_a = c("1", "3"), b = c(2, 4)),
FUN = example_fun_with_string)
#> # A tibble: 2 × 4
#> str_a b added subtracted
#> <chr> <dbl> <dbl> <dbl>
#> 1 1 2 3 -1
#> 2 3 4 7 -1
matsindf_apply(.dat = data.frame(str_a = c("1", "3"), b = c(2, 4)),
FUN = example_fun_with_string,
str_a = "str_a", b = "b")
#> # A tibble: 2 × 4
#> str_a b added subtracted
#> <chr> <dbl> <dbl> <dbl>
#> 1 1 2 3 -1
#> 2 3 4 7 -1
matsindf_apply(.dat = data.frame(m = c("1", "3"), n = c(2, 4)),
FUN = example_fun_with_string,
str_a = "m", b = "n")
#> # A tibble: 2 × 4
#> m n added subtracted
#> <chr> <dbl> <dbl> <dbl>
#> 1 1 2 3 -1
#> 2 3 4 7 -1
Additional niceties are available when .dat
is a data
frame or a tibble. matsindf_apply()
works when the data
frame is filled with single numeric values, as is typical.
df <- data.frame(a = 2:4, b = 1:3)
matsindf_apply(df, FUN = example_fun)
#> # A tibble: 3 × 4
#> a b c d
#> <int> <int> <int> <int>
#> 1 2 1 3 1
#> 2 3 2 5 1
#> 3 4 3 7 1
But matsindf_apply()
also works with
matsindf
data frames, data frames in which each cell of the
data frame is filled with a single matrix. To demonstrate use of
matsindf_apply()
with a matsindf
data frame,
we’ll construct a simple matsindf
data frame
(midf
) using functions in this package.
# Create a tidy data frame containing data for matrices
tidy <- tibble::tibble(Year = rep(c(rep(2017, 4), rep(2018, 4)), 2),
matnames = c(rep("U", 8), rep("V", 8)),
matvals = c(1:4, 11:14, 21:24, 31:34),
rownames = c(rep(c(rep("p1", 2), rep("p2", 2)), 2),
rep(c(rep("i1", 2), rep("i2", 2)), 2)),
colnames = c(rep(c("i1", "i2"), 4),
rep(c("p1", "p2"), 4))) |>
dplyr::mutate(
rowtypes = case_when(
matnames == "U" ~ "Product",
matnames == "V" ~ "Industry",
TRUE ~ NA_character_
),
coltypes = case_when(
matnames == "U" ~ "Industry",
matnames == "V" ~ "Product",
TRUE ~ NA_character_
)
)
tidy
#> # A tibble: 16 × 7
#> Year matnames matvals rownames colnames rowtypes coltypes
#> <dbl> <chr> <int> <chr> <chr> <chr> <chr>
#> 1 2017 U 1 p1 i1 Product Industry
#> 2 2017 U 2 p1 i2 Product Industry
#> 3 2017 U 3 p2 i1 Product Industry
#> 4 2017 U 4 p2 i2 Product Industry
#> 5 2018 U 11 p1 i1 Product Industry
#> 6 2018 U 12 p1 i2 Product Industry
#> 7 2018 U 13 p2 i1 Product Industry
#> 8 2018 U 14 p2 i2 Product Industry
#> 9 2017 V 21 i1 p1 Industry Product
#> 10 2017 V 22 i1 p2 Industry Product
#> 11 2017 V 23 i2 p1 Industry Product
#> 12 2017 V 24 i2 p2 Industry Product
#> 13 2018 V 31 i1 p1 Industry Product
#> 14 2018 V 32 i1 p2 Industry Product
#> 15 2018 V 33 i2 p1 Industry Product
#> 16 2018 V 34 i2 p2 Industry Product
# Convert to a matsindf data frame
midf <- tidy |>
dplyr::group_by(Year, matnames) |>
collapse_to_matrices(rowtypes = "rowtypes", coltypes = "coltypes") |>
tidyr::pivot_wider(names_from = "matnames", values_from = "matvals")
# Take a look at the midf data frame and some of the matrices it contains.
midf
#> # A tibble: 2 × 3
#> Year U V
#> <dbl> <list> <list>
#> 1 2017 <dbl [2 × 2]> <dbl [2 × 2]>
#> 2 2018 <dbl [2 × 2]> <dbl [2 × 2]>
midf$U[[1]]
#> i1 i2
#> p1 1 2
#> p2 3 4
#> attr(,"rowtype")
#> [1] "Product"
#> attr(,"coltype")
#> [1] "Industry"
midf$V[[1]]
#> p1 p2
#> i1 21 22
#> i2 23 24
#> attr(,"rowtype")
#> [1] "Industry"
#> attr(,"coltype")
#> [1] "Product"
With midf
in hand, we can demonstrate use of tidyverse
-style
functional programming to perform matrix algebra within a data frame.
The functions of the matsbyname
package (such as
difference_byname()
below) can be used for this
purpose.
result <- midf |>
dplyr::mutate(
W = difference_byname(transpose_byname(V), U)
)
result
#> # A tibble: 2 × 4
#> Year U V W
#> <dbl> <list> <list> <list>
#> 1 2017 <dbl [2 × 2]> <dbl [2 × 2]> <dbl [2 × 2]>
#> 2 2018 <dbl [2 × 2]> <dbl [2 × 2]> <dbl [2 × 2]>
result$W[[1]]
#> i1 i2
#> p1 20 21
#> p2 19 20
#> attr(,"rowtype")
#> [1] "Product"
#> attr(,"coltype")
#> [1] "Industry"
result$W[[2]]
#> i1 i2
#> p1 20 21
#> p2 19 20
#> attr(,"rowtype")
#> [1] "Product"
#> attr(,"coltype")
#> [1] "Industry"
This way of performing matrix calculations works equally well within
a 2-row matsindf
data frame (as shown above) or within a
1000-row matsindf
data frame.
Programming with matsindf_apply()
Users can write their own functions using
matsindf_apply()
. A flexible calc_W()
function
can be written as follows.
calc_W <- function(.DF = NULL, U = "U", V = "V", W = "W") {
# The inner function does all the work.
W_func <- function(U_mat, V_mat){
# When we get here, U_mat and V_mat will be single matrices or single numbers,
# not a column in a data frame or an item in a list.
if (length(U_mat) == 0 & length(V_mat == 0)) {
# Tolerate zero-length arguments by returning a zero-length
# a list with the correct name and return type.
return(list(numeric()) |> magrittr::setnames(W))
}
# Calculate W_mat from the inputs U_mat and V_mat.
W_mat <- matsbyname::difference_byname(
matsbyname::transpose_byname(V_mat),
U_mat)
# Return a named list.
list(W_mat) |> magrittr::set_names(W)
}
# The body of the main function consists of a call to matsindf_apply
# that specifies the inner function in the FUN argument.
matsindf_apply(.DF, FUN = W_func, U_mat = U, V_mat = V)
}
This style of writing matsindf_apply()
functions is
incredibly versatile, leveraging the capabilities of both the
matsindf
and matsbyname
packages. (Indeed, the
Recca
package uses matsindf_apply()
heavily
and is built upon the functions in the matsindf
and
matsbyname
packages.)
Functions written like calc_W()
can operate in ways
similar to matsindf_apply()
itself. To demonstrate, we’ll
use calc_W()
in all the ways that
matsindf_apply()
can be used, going in the reverse order to
our demonstration of the capabilities of matsindf_apply()
above.
calc_W()
can be used as a specialized
mutate
function that operates on matsindf
data
frames.
midf |> calc_W()
#> # A tibble: 2 × 4
#> Year U V W
#> <dbl> <list> <list> <list>
#> 1 2017 <dbl [2 × 2]> <dbl [2 × 2]> <dbl [2 × 2]>
#> 2 2018 <dbl [2 × 2]> <dbl [2 × 2]> <dbl [2 × 2]>
The added column could be given a different name from the default
(“W
”) using the W
argument.
midf |> calc_W(W = "W_prime")
#> # A tibble: 2 × 4
#> Year U V W_prime
#> <dbl> <list> <list> <list>
#> 1 2017 <dbl [2 × 2]> <dbl [2 × 2]> <dbl [2 × 2]>
#> 2 2018 <dbl [2 × 2]> <dbl [2 × 2]> <dbl [2 × 2]>
As with matsindf_apply()
, column names in
midf
can be mapped to the arguments of
calc_W()
by the arguments to calc_W()
.
midf |>
dplyr::rename(X = U, Y = V) |>
calc_W(U = "X", V = "Y")
#> # A tibble: 2 × 4
#> Year X Y W
#> <dbl> <list> <list> <list>
#> 1 2017 <dbl [2 × 2]> <dbl [2 × 2]> <dbl [2 × 2]>
#> 2 2018 <dbl [2 × 2]> <dbl [2 × 2]> <dbl [2 × 2]>
calc_W()
can operate on lists of single matrices, too.
This approach works, because the default values for the U
and V
arguments to calc_W()
are “U” and “V”,
respectively. The input list members (in this case
midf$U[[1]]
and midf$V[[1]]
) are returned with
the output, because list(U = midf$U[[1]], V = midf$V[[1]])
is passed to the .dat
argument of
matsindf_apply()
.
calc_W(list(U = midf$U[[1]], V = midf$V[[1]]))
#> $U
#> i1 i2
#> p1 1 2
#> p2 3 4
#> attr(,"rowtype")
#> [1] "Product"
#> attr(,"coltype")
#> [1] "Industry"
#>
#> $V
#> p1 p2
#> i1 21 22
#> i2 23 24
#> attr(,"rowtype")
#> [1] "Industry"
#> attr(,"coltype")
#> [1] "Product"
#>
#> $W
#> i1 i2
#> p1 20 21
#> p2 19 20
#> attr(,"rowtype")
#> [1] "Product"
#> attr(,"coltype")
#> [1] "Industry"
It may be clearer to name the arguments as required by the
calc_W()
function without wrapping in a list first, as
shown below. But in this approach, the input matrices are not returned
with the output, because arguments U
and V
are
passed to the ...
argument of
matsindf_apply()
, not the .dat
argument of
matsindf_apply()
.
calc_W(U = midf$U[[1]], V = midf$V[[1]])
#> $W
#> i1 i2
#> p1 20 21
#> p2 19 20
#> attr(,"rowtype")
#> [1] "Product"
#> attr(,"coltype")
#> [1] "Industry"
calc_W()
can operate on data frames containing single
numbers.
data.frame(U = c(1, 2), V = c(3, 4)) |> calc_W()
#> # A tibble: 2 × 3
#> U V W
#> <dbl> <dbl> <dbl>
#> 1 1 3 2
#> 2 2 4 2
Finally, calc_W()
can be applied to single numbers, and
the result is 1x1 matrix.
calc_W(U = 2, V = 3)
#> $W
#> [1] 1
It is good practice to write internal functions that tolerate
zero-length inputs, as calc_W()
does. Doing so, enables
results from different calculations to be rbind
ed
together.
calc_W(U = numeric(), V = numeric())
#> $W
#> numeric(0)
calc_W(list(U = numeric(), V = numeric()))
#> $U
#> numeric(0)
#>
#> $V
#> numeric(0)
#>
#> $W
#> numeric(0)
res <- calc_W(list(U = c(2, 3, 4, 5), V = c(3, 4, 5, 6)))
res0 <- calc_W(list(U = numeric(), V = numeric()))
dplyr::bind_rows(res, res0)
#> # A tibble: 4 × 3
#> U V W
#> <dbl> <dbl> <dbl>
#> 1 2 3 1
#> 2 3 4 1
#> 3 4 5 1
#> 4 5 6 1
Conclusion
This vignette demonstrated use of the versatile
matsindf_apply()
function. Inputs to
matsindf_apply()
can be
- single numbers,
- matrices, or
- data frames with appropriately-named columns.
matsindf_apply()
can be used for programming, and
functions constructed as demonstrated above share characteristics with
matsindf_apply()
:
- they can be used as specialized
dplyr::mutate()
operators, and - they can be applied to single numbers, matrices, or data frames with appropriately-named columns.