1 Overview
The BOUTH (BOttom-Up Tree Hypothesis testing) package implements the bottom-up approach to testing hypotheses that have a branching tree dependence structure, with false discovery rate control [1]. Our motivating example comes from testing the association between a trait of interest and groups of microbes that have been organized into operational taxonomic units (OTUs) or amplicon sequence variants (ASVs). Given p-values from association tests for each individual OTU or ASV, we would like to know if we can declare that a certain species, genus, or higher taxonomic grouping can be considered to be associated with the trait. If a large proportion of species from that genus influence the trait, we should conclude the genus influences the trait. Conversely if only a few of the species from a genus are non-null, then a better description of the microbes that influence occurrence of the trait is a list of associated species. Finding taxa that can be said to influence a trait in this sense is the first goal of our approach. The second goal is to locate the highest taxa in the tree for which we can conclude many taxa below, but not any ancestors above, influence risk; we refer to such taxa as driver taxa.
The main function of the BOUTH package is bouth
. It takes as input a user-specified tree structure anno.table
and p-values at leaf nodes pvalue.leaves
, and performs bottom-up tests. It currently implemented three tests: the one-stage weighted test, the one-stage unweighted test, and the two-stage weighted test. The anno.table
is a data.frame that specifies the annotation (i.e., grouping) of leaf nodes (row) at each level (column), which are ordered with the highest level (the root node) first and the lowest level (the leaf nodes) last. The current version of bouth
assumes only one parent for each node. The pvalue.leaves
is a vector of p-values at leaf nodes, the names of which should match the values in the last column of anno.table
. The BOUTH package also contains the function graphlan
using tree image for visualizing the results of bouth
.
2 Getting Started
The package can be downloaded and installed from GitHub.
> library(devtools)
> install_github("yli1992/BOUTH", force=T)
Load the package in R:
> library("BOUTH")
3 Analysis of the Inflammatory Bowel Disease (IBD) Data
3.1 IBD data
The IBD data included in this package are a subset of the data from a human gut microbiome study [2]. The dataset contains a taxonomic table tax.table
and a vector of p-values pvalue.otus
.
> data(IBD)
> dim(IBD$tax.table)
[1] 2360 8
> length(IBD$pvalue.otus)
[1] 2360
The tax.table
has 8 levels, which from the top to bottom are kingdom, phylum, class, order, family, genus, species, and operational taxonomic units (OTUs), and 2360 OTUs. The pvalue.otus
has p-values for testing the differential abundance of each of the 2360 OTUs between the ulcerative colitis (UC) and control groups. The first 3 records of tax.table
and pvalue.otus
are shown below.
> head(IBD$tax.table, 3)
kingdom phylum class order family genus species
1 Bacteria Firmicutes Clostridia Clostridiales Clostridiaceae 02d06 unknown
2 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus unknown
3 Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae Coprococcus unknown
otu
1 181342
2 4467992
3 4474844
> head(IBD$pvalue.otus, 3)
181342 4467992 4474844
0.31294 0.82785 0.00743
Note that many OTUs have missing assignment at species and genus levels and take the value "unknown"
.
3.2 One-stage bottom-up test
We first perform the one-stage, weighted bottom-up test. The parameter far
(false assignment rate, the error rate considered in [1] in analogy with the false discovery rate concept) is set at 10% and na.symbol
is specified as "unknown"
.
> test.1 = bouth(anno.table = IBD$tax.table, pvalue.leaves = IBD$pvalue.otus,
na.symbol = "unknown", far = 0.1, is.weighted = TRUE)
The output contains three members. The results.by.level
data frame gives the number of detected nodes at each level along with information on the test for that level. Here the one-stage weighted bottom-up test detected 143 OTUs, 5 species, 9 genera, 7 families, 5 orders, 5 classes to be associated with UC.
> test.1$results.by.level
level level.name q_l pvalue.cutoff all.nodes detected.nodes
1 1 otu 9.05e-02 0.00584 2360 143
2 2 species 2.15e-03 0.00238 56 5
3 3 genus 3.33e-03 0.00332 87 9
4 4 family 1.84e-03 0.00362 48 7
5 5 order 1.11e-03 0.00376 29 5
6 6 class 6.90e-04 0.00501 18 5
7 7 phylum 3.83e-04 0.00606 10 0
8 8 kingdom 3.83e-05 0.00666 1 0
The results.by.node
data frame gives detailed results at each taxon (i.e., OTUs, species, genera, etc), including the (derived) p-value, the indicator of a detection or not, and the indicator of a driver taxon or not.
> head(test.1$results.by.node, 3)
kingdom phylum class order family genus species otu pvalue
1 Bacteria 0.350
2 Bacteria Actinobacteria 0.487
3 Bacteria Actinobacteria Actinobacteria_class 0.490
is.detected is.driver
1 FALSE FALSE
2 FALSE FALSE
3 FALSE FALSE
The list of detected nodes or detected driver nodes can be easily extracted from results.by.node
.
> detected.nodes.1 = test.1$results.by.node[test.1$results.by.node$is.detected,]
> head(detected.nodes.1, 3)
kingdom phylum class order family genus species
53 Bacteria Actinobacteria Coriobacteriia
54 Bacteria Actinobacteria Coriobacteriia Coriobacteriales
55 Bacteria Actinobacteria Coriobacteriia Coriobacteriales Coriobacteriaceae
otu pvalue is.detected is.driver
53 0.00167 TRUE TRUE
54 0.00167 TRUE FALSE
55 0.00167 TRUE FALSE
> driver.nodes.1 = test.1$results.by.node[test.1$results.by.node$is.driver,]
> head(driver.nodes.1, 3)
kingdom phylum class order family genus
53 Bacteria Actinobacteria Coriobacteriia
118 Bacteria Bacteroidetes Bacteroidia Bacteroidales [Paraprevotellaceae] [Prevotella]
132 Bacteria Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae
species otu pvalue is.detected is.driver
53 1.67e-03 TRUE TRUE
118 5.52e-05 TRUE TRUE
132 2.59e-03 TRUE TRUE
The output contains a third member tree
, which is used for the visualization function graphlan
. To use graphlan
, the user needs to first download and install the python library GraPhlAn [3] from https://bitbucket.org/nsegata/graphlan/overview. The graphlan
function requires the output from bouth
, the output directory (the default is the currently working directory) for storing the output image file, and the directory where the GraPhlAn package is located.
> graphlan(bouth.out = test.1, output.dir = getwd(), graphlan.dir = "graphlan_directory")
A plot.png
file can be found in output.dir
, which is shown in Figure 1.
3.3 Two-stage bottom-up test
In some situations we may want to have separate control of FAR at level 1 (leaf nodes) and all remaining higher levels (inner nodes). For example, we may wish to first determine which OTUs are detected while controlling FAR at some level; then we may wish to conduct a second, separate analysis of taxa starting at the species level and continuing up the phylogenetic tree, while controlling the FAR of the second analysis at some level.
For a two-stage test, the parameter far
should be specified as a vector of two FARs, one for level 1 test and one for the remaining nodes. In this example, we consider both FARs to be 5%.
> test.2 = bouth(anno.table = IBD$tax.table, pvalue.leaves = IBD$pvalue.otus,
na.symbol = "unknown", far = c(0.05, 0.05))
The two-stage bottom-up test detected 81 OTUs, 5 species, 5 genera, 5 families, 5 orders, 4 classes and 2 phyla to be associated with UC.
> test.2$results.by.level
level level.name q_l pvalue.cutoff all.nodes detected.nodes
1 1 otu 0.050000 0.001796 2360 81
2 2 species 0.011224 0.000409 56 5
3 3 genus 0.017551 0.000968 87 5
4 4 family 0.009796 0.001592 48 5
5 5 order 0.005714 0.002053 29 5
6 6 class 0.003469 0.003027 18 4
7 7 phylum 0.002041 0.005188 10 2
8 8 kingdom 0.000204 0.004672 1 0
4. References
- Li, Y., Satten, GA., Hu Y.-J. “A bottom-up approach to testing hypotheses that have a branching tree dependence structure, with false discovery rate control” XXX(2018)
- Halfvarson, J, et al. “Dynamics of the human gut microbiome in inflammatory bowel disease.” Nature microbiology 2.5 (2017): 17004.
- Asnicar, F, et al. “Compact graphical representation of phylogenetic data and metadata with GraPhlAn.” PeerJ 3 (2015): e1029.