Relative Abundance Corrected Beta Diversity
This method was described in Brocklehurst, N., Day, M.O. & Fröbisch, J. 2018. Accounting for differences in species frequency distributions when calculating beta diversity in the fossil record. Methods in Ecology and Evolution (9): 1409-1420. See this paper for references and more detail on the concepts
Back in 1960, Robert Whittaker coined the terms Alpha, Beta and Gamma diversity. Gamma diversity is the total species richness of an assemblage (or “landscape”, as he put it), and is determined by the species richness in each habitat (alpha diversity) and the amount of faunal differentiation between each habitat (beta diversity). In more recent years, beta diversity has been used to compare a wider range of spatial scales than just habitats, including localities, geological formations and basins (for palaeontologists), “bioregions” (sometimes somewhat arbitrarily defined) and continents. Usually, beta diversity is calculated as a set of pairwise taxonomic distances between the localities. There’s a huge variety of distance metrics, but they largely work on the same idea: comparing the number of species shared between localities to the total number of species in each. The metrics range between 0 (all species shared between the localities) and 1 (localities have completely different species).
In any ecological study, but in particular in palaeontology , you have to account for the fact that your sample is incomplete. In the case of beta diversity, an incomplete sample means that your beta diversity estimate will increase: the worse your sampling, the less likely you are to find the species in common.
This isn’t too bad (its an issue that can be dealt with by subsampling), but the problem is made worse by another issue: the evenness of the abundance distribution. Not every species has the same abundances, and the range of abundances will vary from habitat to habitat, assemblage to assemblage, time bin to time bin. The most obvious examples of shifts in the shape of abundance distributions are during times of extreme environmental stress and mass extinction, where you often get one or a few hyperabundant species dominating e.g. Lystrosaurus, which following the end-Permian mass extinction made up almost half of all specimens found in its range. Shifts in abundance distribution will affect beta diversity estimates in a way not corrected for by subsampling (since subsampling is also affected by this issue). The problem is simple: all else being equal, you are more likely to sample the common species. So, let us say we have two localities in a homogenous fauna (beta diversity of 0). If there is a very uneven abundance distribution with one hyper abundant species, it is easy to sample that species in both localities. If, however, the abundance distribution is more uneven, every species is harder to find, and you are less likely to find the same set of species in each locality, and the beta diversity estimate will be raised.
As a solution to this problem, I came up with the Relative Abundance Corrected (RAC) Beta Diversity metric (no affiliation with the providers of breakdown cover). In short (see paper for full details) a null beta diversity estimate is calculated by drawing at random the same number of species in each locality from the total abundance distribution from all localities. This is effectively simulating an incompletely sampled homogenous fauna, and tells you what your beta diversity estimate should be given the level of sampling and evenness of the abundance distribution. You then subtract this null from the observed beta diversity estimate, and rescale so that the null is at 0 and the maximum is 1. This has all been wrapped up in an R function, and now I’m going to provide a tutorial on how to use it. Links to download functions and example files are in the tutorial
1. If not yet installed, install the R package vegan
2. Read the vegan package:
library(vegan)
3. Read in the RACB.Diversity() function and, if you intend to use John Alroy’s modification of the Forbes distance metric, the Alroy.Forbes() function (can download via the links)
4. Download the example.csv file and place it in your working directory; you can check what your working directory is with the line:
getwd()
5. Load data and store as an object called “dataset”. The data needs to be a matrix of abundances* with taxa in rows and localities in columns**. The example .csv file provided may be read into R using the following line:
dataset<-read.csv("example.csv",row.name=1,header=T)
*NOTE: actual abundances are required to calculate the null, even if the taxonomic distance metric you plan to use only needs presence/absence data.
**NOTE: row names and column names are required, and each must be unique
6. Choose analysis parameters. The RACB.Diversity() function has six arguments
· data: the abundance matrix (see step 5 for required format)
· metric: taxonomic distance metric. At the moment I’ve included four options (shouldn’t be too difficult to add more if needs must) - "Forbes" (Alroy 2015); "Sorenson" (Sørenson 1948) [default]; "Lennon" (Lennon et al. 2001); "Bray" (Bray & Curtis 1957)
· sim.iter: a number greater than or equal to 100, representing how many simulated homogenous datasets will be generated to calculate the null. Note function will actually not work with less than 100 due to the method used calculate the 95% confidence intervals. Default is 1000.
· samp.stand: logical (TRUE or FALSE); whether to use coverage-based subsampling. Default is TRUE. The next two arguments are only needed if samp.stand=TRUE
· samp.stand.size: a number between 0 and 1. Coverage to which each locality will be subsampled to. Note that if any localities have coverage less than specified (measured by Good's u), the function will automatically remove these localities from calculation of beta diversity. Default is 0.6.
· samp.stand.iter: Number greater than or equal to 1, representing the number of subsampling trials to be carried out (subsampling is carried on both the raw data and the simulated homogenous faunas used to calculate the null). Default is 1000.
7. Run the following line of code to perform the analysis and store the output as the object “beta”. The analysis run using the line of code below uses the Sørenson taxonomic distance metrix, does not include subsampling, and uses 100 simulated datasets to calculate the null***.
beta<-RACB.Diversity(data=dataset, metric="Sorenson", sim.iter=100, samp.stand=F)
***NOTE: these parameters were chosen to speed up the example; they are NOT a recommendation.
8. Examine output by typing beta into the R console. The function outputs a vector of length seven, with names:
· “Raw Beta Diversity”: the mean pairwise taxonomic distance between each of the localities (no RAC correction applied). If samp.stand was set to TRUE, this will have incorporated sampling standardisation.
· "Null Expectation": the beta diversity that would be expected were a homogenous faunal with the same abundance distribution as the observed fauna sampled to the same extent as the observed fauna.
· "Null Upper": upper confidence interval of the null.
· "Null Lower": upper confidence interval of the null.
· "RAC Beta Diversity": The relative-abundance-corrected beta diversity estimate.
· "RAC Upper": upper confidence interval of the RAC beta diversity estimate
· "RAC Lower": lower confidence interval of the RAC beta diversity estimate
Back in 1960, Robert Whittaker coined the terms Alpha, Beta and Gamma diversity. Gamma diversity is the total species richness of an assemblage (or “landscape”, as he put it), and is determined by the species richness in each habitat (alpha diversity) and the amount of faunal differentiation between each habitat (beta diversity). In more recent years, beta diversity has been used to compare a wider range of spatial scales than just habitats, including localities, geological formations and basins (for palaeontologists), “bioregions” (sometimes somewhat arbitrarily defined) and continents. Usually, beta diversity is calculated as a set of pairwise taxonomic distances between the localities. There’s a huge variety of distance metrics, but they largely work on the same idea: comparing the number of species shared between localities to the total number of species in each. The metrics range between 0 (all species shared between the localities) and 1 (localities have completely different species).
In any ecological study, but in particular in palaeontology , you have to account for the fact that your sample is incomplete. In the case of beta diversity, an incomplete sample means that your beta diversity estimate will increase: the worse your sampling, the less likely you are to find the species in common.
This isn’t too bad (its an issue that can be dealt with by subsampling), but the problem is made worse by another issue: the evenness of the abundance distribution. Not every species has the same abundances, and the range of abundances will vary from habitat to habitat, assemblage to assemblage, time bin to time bin. The most obvious examples of shifts in the shape of abundance distributions are during times of extreme environmental stress and mass extinction, where you often get one or a few hyperabundant species dominating e.g. Lystrosaurus, which following the end-Permian mass extinction made up almost half of all specimens found in its range. Shifts in abundance distribution will affect beta diversity estimates in a way not corrected for by subsampling (since subsampling is also affected by this issue). The problem is simple: all else being equal, you are more likely to sample the common species. So, let us say we have two localities in a homogenous fauna (beta diversity of 0). If there is a very uneven abundance distribution with one hyper abundant species, it is easy to sample that species in both localities. If, however, the abundance distribution is more uneven, every species is harder to find, and you are less likely to find the same set of species in each locality, and the beta diversity estimate will be raised.
As a solution to this problem, I came up with the Relative Abundance Corrected (RAC) Beta Diversity metric (no affiliation with the providers of breakdown cover). In short (see paper for full details) a null beta diversity estimate is calculated by drawing at random the same number of species in each locality from the total abundance distribution from all localities. This is effectively simulating an incompletely sampled homogenous fauna, and tells you what your beta diversity estimate should be given the level of sampling and evenness of the abundance distribution. You then subtract this null from the observed beta diversity estimate, and rescale so that the null is at 0 and the maximum is 1. This has all been wrapped up in an R function, and now I’m going to provide a tutorial on how to use it. Links to download functions and example files are in the tutorial
1. If not yet installed, install the R package vegan
2. Read the vegan package:
library(vegan)
3. Read in the RACB.Diversity() function and, if you intend to use John Alroy’s modification of the Forbes distance metric, the Alroy.Forbes() function (can download via the links)
4. Download the example.csv file and place it in your working directory; you can check what your working directory is with the line:
getwd()
5. Load data and store as an object called “dataset”. The data needs to be a matrix of abundances* with taxa in rows and localities in columns**. The example .csv file provided may be read into R using the following line:
dataset<-read.csv("example.csv",row.name=1,header=T)
*NOTE: actual abundances are required to calculate the null, even if the taxonomic distance metric you plan to use only needs presence/absence data.
**NOTE: row names and column names are required, and each must be unique
6. Choose analysis parameters. The RACB.Diversity() function has six arguments
· data: the abundance matrix (see step 5 for required format)
· metric: taxonomic distance metric. At the moment I’ve included four options (shouldn’t be too difficult to add more if needs must) - "Forbes" (Alroy 2015); "Sorenson" (Sørenson 1948) [default]; "Lennon" (Lennon et al. 2001); "Bray" (Bray & Curtis 1957)
· sim.iter: a number greater than or equal to 100, representing how many simulated homogenous datasets will be generated to calculate the null. Note function will actually not work with less than 100 due to the method used calculate the 95% confidence intervals. Default is 1000.
· samp.stand: logical (TRUE or FALSE); whether to use coverage-based subsampling. Default is TRUE. The next two arguments are only needed if samp.stand=TRUE
· samp.stand.size: a number between 0 and 1. Coverage to which each locality will be subsampled to. Note that if any localities have coverage less than specified (measured by Good's u), the function will automatically remove these localities from calculation of beta diversity. Default is 0.6.
· samp.stand.iter: Number greater than or equal to 1, representing the number of subsampling trials to be carried out (subsampling is carried on both the raw data and the simulated homogenous faunas used to calculate the null). Default is 1000.
7. Run the following line of code to perform the analysis and store the output as the object “beta”. The analysis run using the line of code below uses the Sørenson taxonomic distance metrix, does not include subsampling, and uses 100 simulated datasets to calculate the null***.
beta<-RACB.Diversity(data=dataset, metric="Sorenson", sim.iter=100, samp.stand=F)
***NOTE: these parameters were chosen to speed up the example; they are NOT a recommendation.
8. Examine output by typing beta into the R console. The function outputs a vector of length seven, with names:
· “Raw Beta Diversity”: the mean pairwise taxonomic distance between each of the localities (no RAC correction applied). If samp.stand was set to TRUE, this will have incorporated sampling standardisation.
· "Null Expectation": the beta diversity that would be expected were a homogenous faunal with the same abundance distribution as the observed fauna sampled to the same extent as the observed fauna.
· "Null Upper": upper confidence interval of the null.
· "Null Lower": upper confidence interval of the null.
· "RAC Beta Diversity": The relative-abundance-corrected beta diversity estimate.
· "RAC Upper": upper confidence interval of the RAC beta diversity estimate
· "RAC Lower": lower confidence interval of the RAC beta diversity estimate