The abn
package is a comprehensive tool for Bayesian
network (BN) analysis, a form of probabilistic graphical model. It
derives a directed acyclic graph from empirical data, describing the
dependency structure between random variables. This package provides
routines for structure learning and parameter estimation of additive
Bayesian network (ABN) models.
BNs are a type of statistical model that leverages the principles of Bayesian statistics and graph theory to provide a framework for representing complex multivariate data. ABN models extend the concept of generalized linear models, typically used for predicting a single outcome, to scenarios with multiple dependent variables (e.g. Kratzer et al. (2023)). This makes them a powerful tool for understanding complex, multivariate datasets.
The need for a tool like abn
arises from the increasing
complexity of data in various fields. Researchers often face
multivariate, tabular data where the relationships between variables are
not straightforward. BN analysis becomes essential when traditional
statistical methods fall short in analyzing multivariate data with
intricate relationships, as it models these relationships graphically
for more straightforward data interpretation.
However, most existing implementations of BN models limit variable
types, often allowing discrete variables to have only discrete parent
variables, where a parent starts a directed edge in the graph. This
limitation can pose challenges when dealing with continuous or
mixed-type data (i.e., data that includes both continuous and discrete
variables) or when attempting to model complex relationships that do not
fit these restricted categories. Further details have been discussed in
the context of patient data in the study from Delucchi et al. (2022), particularly focusing on
the widely used bnlearn
package (Scutari 2010) and the abn
package.
The abn
package overcomes these limitations through its
additive model formulation, which generalizes the usual (Bayesian)
multivariable regression to accommodate multiple dependent variables.
Additionally, the abn
package offers a comprehensive suite
of features for model selection, structure learning, and parameter
estimation. It includes exact and greedy search algorithms for structure
learning and allows for integrating prior expert knowledge into the
model selection process by specifying structural constraints. Unlike
other software, abn
offers a Bayesian and
information-theoretic model scoring approach. Furthermore, it supports
mixed-effect models to control one-layer clustering, making it suitable,
e.g., for handling data from different sources.
Previous versions of the abn
package have been
successfully used in various fields, including epidemiology Kratzer and Furrer (2018) and health Delucchi et al. (2022). Despite its promise, the
abn
package encountered historical obstacles. Sporadic
maintenance and an incomplete codebase hindered its full potential.
Recognizing the need for enhancement, we undertook a substantial upgrade
and meticulously addressed legacy issues, revamped the codebase, and
introduced significant improvements. The latest version 3 of
abn
is now a robust and reliable tool for BN analysis.
Applying the latest standards for open-source software, we guarantee
active maintenance of abn
. Future updates are planned to
enhance its functionality and user experience further. We highly value
feedback from the user community, which will guide our ongoing
developments.
In summary, abn
sets itself apart by emphasizing ABNs
and its exhaustive features for model selection and structure learning.
Its unique contribution is the implementation of mixed-effect BN models,
thereby extending its applicability to a broader range of complex,
multivariate datasets of mixed, continuous and discrete data.
As outlined in Kratzer et al. (2023), the package’s comprehensive framework integrates the mixed-effects model for clustered data, considering data heterogeneity and grouping effects. However, this was confined to a Bayesian context. The implementation under the information-theoretic (“mle”) setting was notably missing in previous versions, an omission that has been rectified in the current version 3 onwards.
Analyzing hierarchical or grouped data, i.e., observations nested
within higher-level units, requires statistical models with
group-varying parameters (e.g., mixed-effect models). The
abn
package facilitates single-layer clustering, where
observations are grouped into a single layer of clusters. These clusters
are assumed to be independent, but intra-cluster observations may
exhibit correlation (e.g., students within schools, patient-specific
measurements over time, etc.). The ABN model is fitted independently as
a varying intercept model, where the intercept can vary while the slope
is assumed constant across all group levels.
Under the frequentist paradigm (method = "mle"
),
abn
employs the lme4
package (Bates et al. 2015) to fit generalized linear
mixed models for each of the Binomial, Poisson, and Gaussian distributed
variables. For multinomial distributed variables, abn
fits
a multinomial baseline category logit model with random effects using
the mclogit
package (Elff
2022). Currently, only single-layer clustering is supported
(e.g., for method = "mle"
, this corresponds to a random
intercept model).
With a Bayesian approach (method = "bayes"
),
abn
utilizes its own implementation of the Laplace
approximation as well as the INLA
package (Martins et al. 2013) to fit a single-level
hierarchical model for Binomial, Poisson, and Gaussian distributed
variables. Independent of the type of data, multinomial distributed
variables are not yet implemented with method ="bayes"
(details in the online manual).
Furthermore, the code base has been enhanced to be more efficient,
reliable, and user-friendly through code optimization, regular reviews
and continuous integration practices. We have adhered to the latest
open-source software standards, including active maintenance of
abn
. Future updates to augment its functionality are
planned via a flexible roadmap. User feedback is valued through open
communication channels, which will steer our ongoing developments.
Consequently, the latest version of abn
is now a robust and
reliable tool for BN analysis.
A comprehensive set of documented case studies has been published to
validate the abn
package (see the abn
website). The numerical
accuracy and quality assurance exercises were demonstrated in Kratzer et al. (2023). A rigorous testing
framework is implemented using the testthat
package (Wickham 2011), which is executed as part of an
extensive continuous integration pipeline designed explicitly for
non-standard R packages that rely on Rcpp
(Eddelbuettel et al. 2023) and JAGS
(Plummer 2003). Additional documentation
and resources are available on the abn
website for further
reference and guidance.
The abn
package is available on CRAN and can be
installed with the R command:
The development version of the abn
package is hosted on
GitHub and can be
installed using the devtools
package:
The development of the abn
package would not have been
possible without the significant contributions of the former developers
whose efforts have been instrumental in shaping this project. We
acknowledge the contributions of Fraser Iain Lewis, Marta Pittavino,
Gilles Kratzer, and Kalina Cherneva, in particular. We want to extend
our gratitude to the faculty staff at the Department of Mathematical Modeling and
Machine Learning from the University of Zurich and the Institute of Mathematics who
maintain the research and teaching infrastructure. Our appreciation also
goes to the UZH and the ZHAW for their financial support. We would like
to highlight the funding from the Digitalization Initiative of the
Zurich Higher Education Institutions (DIZH), which was instrumental in
the realization of this project, particularly within the context of the
“Modeling of multicentric and dynamic stroke health data” and “Stroke
DynamiX” projects. This work has been conducted as part of M.D.’s PhD
project, which is supervised by R.F.