The BOUTH Package

1 Overview

The BOUTH (BOttom-Up Tree Hypothesis testing) package implements the bottom-up approach to testing hypotheses that have a branching tree dependence structure, with false discovery rate control [1]. Our motivating example comes from testing the association between a trait of interest and groups of microbes that have been organized into operational taxonomic units (OTUs) or amplicon sequence variants (ASVs). Given p-values from association tests for each individual OTU or ASV, we would like to know if we can declare that a certain species, genus, or higher taxonomic grouping can be considered to be associated with the trait. If a large proportion of species from that genus influence the trait, we should conclude the genus influences the trait. Conversely if only a few of the species from a genus are non-null, then a better description of the microbes that influence occurrence of the trait is a list of associated species. Finding taxa that can be said to influence a trait in this sense is the first goal of our approach. The second goal is to locate the highest taxa in the tree for which we can conclude many taxa below, but not any ancestors above, influence risk; we refer to such taxa as driver taxa.

The main function of the BOUTH package is bouth. It takes as input a user-specified tree structure anno.table and p-values at leaf nodes pvalue.leaves, and performs bottom-up tests. It currently implemented three tests: the one-stage weighted test, the one-stage unweighted test, and the two-stage weighted test. The anno.table is a data.frame that specifies the annotation (i.e., grouping) of leaf nodes (row) at each level (column), which are ordered with the highest level (the root node) first and the lowest level (the leaf nodes) last. The current version of bouth assumes only one parent for each node. The pvalue.leaves is a vector of p-values at leaf nodes, the names of which should match the values in the last column of anno.table. The BOUTH package also contains the function graphlan using tree image for visualizing the results of bouth.

2 Getting Started

The package can be downloaded and installed from GitHub.

> library(devtools)
> install_github("yli1992/BOUTH", force=T)

Load the package in R:

> library("BOUTH")

3 Analysis of the Inflammatory Bowel Disease (IBD) Data

3.1 IBD data

The IBD data included in this package are a subset of the data from a human gut microbiome study [2]. The dataset contains a taxonomic table tax.table and a vector of p-values pvalue.otus.

> data(IBD)

> dim(IBD$tax.table)
[1] 2360 8

> length(IBD$pvalue.otus)
[1] 2360

The tax.table has 8 levels, which from the top to bottom are kingdom, phylum, class, order, family, genus, species, and operational taxonomic units (OTUs), and 2360 OTUs. The pvalue.otus has p-values for testing the differential abundance of each of the 2360 OTUs between the ulcerative colitis (UC) and control groups. The first 3 records of tax.table and pvalue.otus are shown below.

> head(IBD$tax.table, 3)
   kingdom     phylum      class           order           family         genus species
1 Bacteria Firmicutes Clostridia   Clostridiales   Clostridiaceae         02d06 unknown
2 Bacteria Firmicutes    Bacilli Lactobacillales Streptococcaceae Streptococcus unknown
3 Bacteria Firmicutes Clostridia   Clostridiales  Lachnospiraceae   Coprococcus unknown
      otu
1  181342
2 4467992
3 4474844

> head(IBD$pvalue.otus, 3)
 181342 4467992 4474844 
0.31294 0.82785 0.00743

Note that many OTUs have missing assignment at species and genus levels and take the value "unknown".

3.2 One-stage bottom-up test

We first perform the one-stage, weighted bottom-up test. The parameter far (false assignment rate, the error rate considered in [1] in analogy with the false discovery rate concept) is set at 10% and na.symbol is specified as "unknown".

> test.1 = bouth(anno.table = IBD$tax.table, pvalue.leaves = IBD$pvalue.otus,
                na.symbol = "unknown", far = 0.1, is.weighted = TRUE)

The output contains three members. The results.by.level data frame gives the number of detected nodes at each level along with information on the test for that level. Here the one-stage weighted bottom-up test detected 143 OTUs, 5 species, 9 genera, 7 families, 5 orders, 5 classes to be associated with UC.

> test.1$results.by.level
  level level.name      q_l pvalue.cutoff all.nodes detected.nodes
1     1        otu 9.05e-02       0.00584      2360            143
2     2    species 2.15e-03       0.00238        56              5
3     3      genus 3.33e-03       0.00332        87              9
4     4     family 1.84e-03       0.00362        48              7
5     5      order 1.11e-03       0.00376        29              5
6     6      class 6.90e-04       0.00501        18              5
7     7     phylum 3.83e-04       0.00606        10              0
8     8    kingdom 3.83e-05       0.00666         1              0

The results.by.node data frame gives detailed results at each taxon (i.e., OTUs, species, genera, etc), including the (derived) p-value, the indicator of a detection or not, and the indicator of a driver taxon or not.

> head(test.1$results.by.node, 3)
   kingdom         phylum                class order family genus species otu pvalue
1 Bacteria                                                                     0.350
2 Bacteria Actinobacteria                                                      0.487
3 Bacteria Actinobacteria Actinobacteria_class                                 0.490
  is.detected is.driver
1       FALSE     FALSE
2       FALSE     FALSE
3       FALSE     FALSE

The list of detected nodes or detected driver nodes can be easily extracted from results.by.node.

> detected.nodes.1 = test.1$results.by.node[test.1$results.by.node$is.detected,]
> head(detected.nodes.1, 3)
    kingdom         phylum          class            order            family genus species
53 Bacteria Actinobacteria Coriobacteriia                                                 
54 Bacteria Actinobacteria Coriobacteriia Coriobacteriales                                
55 Bacteria Actinobacteria Coriobacteriia Coriobacteriales Coriobacteriaceae              
   otu  pvalue is.detected is.driver
53     0.00167        TRUE      TRUE
54     0.00167        TRUE     FALSE
55     0.00167        TRUE     FALSE

> driver.nodes.1 = test.1$results.by.node[test.1$results.by.node$is.driver,]
> head(driver.nodes.1, 3)
     kingdom         phylum          class         order               family        genus
53  Bacteria Actinobacteria Coriobacteriia                                                
118 Bacteria  Bacteroidetes    Bacteroidia Bacteroidales [Paraprevotellaceae] [Prevotella]
132 Bacteria  Bacteroidetes    Bacteroidia Bacteroidales       Bacteroidaceae             
    species otu   pvalue is.detected is.driver
53              1.67e-03        TRUE      TRUE
118             5.52e-05        TRUE      TRUE
132             2.59e-03        TRUE      TRUE

The output contains a third member tree, which is used for the visualization function graphlan. To use graphlan, the user needs to first download and install the python library GraPhlAn [3] from https://bitbucket.org/nsegata/graphlan/overview. The graphlan function requires the output from bouth, the output directory (the default is the currently working directory) for storing the output image file, and the directory where the GraPhlAn package is located.

> graphlan(bouth.out = test.1, output.dir = getwd(), graphlan.dir = "graphlan_directory")

A plot.png file can be found in output.dir, which is shown in Figure 1.

Figure 1. Taxa (marked in red) detected to be differentially abundant between the UC and control groups in the IBD data by the one-stage weighted bottom-up method

3.3 Two-stage bottom-up test

In some situations we may want to have separate control of FAR at level 1 (leaf nodes) and all remaining higher levels (inner nodes). For example, we may wish to first determine which OTUs are detected while controlling FAR at some level; then we may wish to conduct a second, separate analysis of taxa starting at the species level and continuing up the phylogenetic tree, while controlling the FAR of the second analysis at some level.

For a two-stage test, the parameter far should be specified as a vector of two FARs, one for level 1 test and one for the remaining nodes. In this example, we consider both FARs to be 5%.

> test.2 = bouth(anno.table = IBD$tax.table, pvalue.leaves = IBD$pvalue.otus,
                 na.symbol = "unknown", far = c(0.05, 0.05))

The two-stage bottom-up test detected 81 OTUs, 5 species, 5 genera, 5 families, 5 orders, 4 classes and 2 phyla to be associated with UC.

> test.2$results.by.level
  level level.name      q_l pvalue.cutoff all.nodes detected.nodes
1     1        otu 0.050000      0.001796      2360             81
2     2    species 0.011224      0.000409        56              5
3     3      genus 0.017551      0.000968        87              5
4     4     family 0.009796      0.001592        48              5
5     5      order 0.005714      0.002053        29              5
6     6      class 0.003469      0.003027        18              4
7     7     phylum 0.002041      0.005188        10              2
8     8    kingdom 0.000204      0.004672         1              0

4. References

Li, Y., Satten, GA., Hu Y.-J. “A bottom-up approach to testing hypotheses that have a branching tree dependence structure, with false discovery rate control” XXX(2018)
Halfvarson, J, et al. “Dynamics of the human gut microbiome in inflammatory bowel disease.” Nature microbiology 2.5 (2017): 17004.
Asnicar, F, et al. “Compact graphical representation of phylogenetic data and metadata with GraPhlAn.” PeerJ 3 (2015): e1029.