Simulating a Biogeographic History
This method was described in Brocklehurst, N., Dunne, E.M., Cashmore, D.D., & Fröbisch, J. 2018. Physical and environmental drivers of Paleozoic tetrapod dispersal across Pangaea. Nature Communications (9): 5216. See this paper for references and more detail on the concepts
Simulations are useful for so many things. You make predictions about the effects of parameter changes, test if your methods work, produce example or null datasets…the possibilities are endless.
Here I’m giving you some code that simulates a biogeographic history of a clade. That is, it simulates how the geographic location of lineages have changed through time. The simulation takes a phylogeny and uses a continuous-time Markov process to position various biogeographic events over it. These are:
1. dispersal – a lineage jumping from one area to another, effectively a range expansion;
2. local extinction – a lineage present in more than one area disappears from one region, a range contraction
3.vicariance – a lineage present in two or more regions has a barrier inserted within its range, causing speciation and leading to two lineages, one covering one portion of the original range, a second in the other portion.
The first two of those happen along the branches of the phylogeny, but because vicariance coincides with a speciation events, they happen on a node.
Now a quick warning before we get started on the function; I created this for quite a specific reason. Basically, its part of a new method I designed for calculating dispersal rates through time; I needed a null simulation to show how many dispersal events might be expected to occur if the rate was constant but the number of species (and therefore opportunities for dispersal) changed. See the paper above for full details if you’re interested. As such, the function spits out the information that I found useful when doing this, but others might not find everything it produces interesting or would like it to produce extra info. Let me know if there’s anything you’d be interested in adding.
So, lets get started with the tutorial. Links to download functions and example files are within.
1. If not yet installed, install the R packages spuRs, paleotree, and BioGeoBEARS and all its dependencies*
*NOTE BioGeoBEARS is a bit of a faff to install. See its website for full instructions: http://phylo.wikidot.com/biogeobears
2. Read in the spuRs, paleotree and BioGeoBEARS package and its dependencies**
library(spuRs)
library(paleotree)
library(rexpokit)
library(cladoRcpp)
library(BioGeoBEARS)
**NOTE Normally it doesn’t matter what order you read in the packages, but apparently for BioGeoBEARS it does.
3. Read in the diva.sim() function (download via the link)
4. The function simulates a biogeographic history over a phylogeny, so obviously we need a phylogeny (and it needs to be time-calibrated; the length of the branches needs to represent time). For purposes of example, we’re just going to generate one at random here and store it as an object called “tree”.
tree<-rtree(20, br = runif(1,5,10))
5. The next thing we need is a two column matrix of time bins, where the first column is the beginning of the time bin, the second is the end of it. This is not going to be useful for most people, but I needed it for my calculations of rate through time, so it is required for the function to work. So here I’m just creating 10 equal sized bins covering 10 million years each, and storing it as an object called “time”.
time<- matrix(ncol=2,nrow=10)
time[,1]<-seq(from=100,to=10,by=-10)
time[,2]<-seq(from=90,to=0,by=-10)
time
6. The last thing you might need (its optional) is an area adjacency matrix. Geography is being treated as a set of discrete entities, and a lineage is either present or absent in each region. No information on distance or region size is included. You could simulate it so that a lineage can disperse from any area to any area, but that might not be biologically realistic; for example, a flightless animal could not disperse from South Africa to Europe without passing through northern Africa. The area adjacency matrix can tell the simulation what areas are next to eachother and limit dispersal to between adjacent areas. It needs to be a pairwise, symmetrical matrix comparing each area to each area. A 1 indicates that two areas are adjacent, a 0 that they are not. Here’s one I’ve just made up, stored as an object called “adjacent”.
adjacent<-matrix(nrow=4,ncol=4,data=c(1,1,0,0,1,1,1,0,0,1,1,1,0,0,1,1))
adjacent
7. So, onto the diva.sim() function. It has eight arguments
- test.tree: the phylogeny you want to simulate your biogeographic history over
- no.areas: the number of regions in the simulated geospacer. Each will be named with the letters of the alphabet by the function, so the
maximum possible is 26
- max.areas: maximum number of areas in which a taxon, lineage, or node may be present. Defaults to the same as the number of areas.
- time.bins: a two column matrix giving the beginning and end of the time bins in which biogeographic events will be counted
- disp.rate: dispersal rate (in practice the probability that a dispersal event will occur at any point in time along a branch)
- vic.rate - vicariance rate (in practice, the probability that a vicariance event will take place at a node)
- ext.rate - local extinction rate (in practice the probability that a local extinction event will occur at any point in time along a branch)
8. Now, finally (sorry, this has been a long intro) we’ll run our simulation. Its going to have four areas, the area adjancy matrix, and the default rates of dispersal, vicariance and local extinction: 0.03. We’ll store the output as an object called “sim”
sim<-diva.sim(test.tree=tree,no.areas=4,max.areas=4,time.bins=time,area.adj=adjacent,disp.rate=0.03,vic.rate=0.03,ext.rate=0.03
9. So, what have we got? The output is a list with four subsections
- $node.states is a matrix where each row represents a node (both tip and internal are included, in the order of the original phylo object) and each column represents an area in which it lives
sim$node.states
- $disp.counts is a vector, giving the number of dispersal events in each time bin (the time bins being the ones we defined before, in that order).
- $vic.counts is the same for vicariance events
- $ext.counts is the same for local extinction events
sim$disp.counts
sim$vic.counts
sim$ext.counts
Simulations are useful for so many things. You make predictions about the effects of parameter changes, test if your methods work, produce example or null datasets…the possibilities are endless.
Here I’m giving you some code that simulates a biogeographic history of a clade. That is, it simulates how the geographic location of lineages have changed through time. The simulation takes a phylogeny and uses a continuous-time Markov process to position various biogeographic events over it. These are:
1. dispersal – a lineage jumping from one area to another, effectively a range expansion;
2. local extinction – a lineage present in more than one area disappears from one region, a range contraction
3.vicariance – a lineage present in two or more regions has a barrier inserted within its range, causing speciation and leading to two lineages, one covering one portion of the original range, a second in the other portion.
The first two of those happen along the branches of the phylogeny, but because vicariance coincides with a speciation events, they happen on a node.
Now a quick warning before we get started on the function; I created this for quite a specific reason. Basically, its part of a new method I designed for calculating dispersal rates through time; I needed a null simulation to show how many dispersal events might be expected to occur if the rate was constant but the number of species (and therefore opportunities for dispersal) changed. See the paper above for full details if you’re interested. As such, the function spits out the information that I found useful when doing this, but others might not find everything it produces interesting or would like it to produce extra info. Let me know if there’s anything you’d be interested in adding.
So, lets get started with the tutorial. Links to download functions and example files are within.
1. If not yet installed, install the R packages spuRs, paleotree, and BioGeoBEARS and all its dependencies*
*NOTE BioGeoBEARS is a bit of a faff to install. See its website for full instructions: http://phylo.wikidot.com/biogeobears
2. Read in the spuRs, paleotree and BioGeoBEARS package and its dependencies**
library(spuRs)
library(paleotree)
library(rexpokit)
library(cladoRcpp)
library(BioGeoBEARS)
**NOTE Normally it doesn’t matter what order you read in the packages, but apparently for BioGeoBEARS it does.
3. Read in the diva.sim() function (download via the link)
4. The function simulates a biogeographic history over a phylogeny, so obviously we need a phylogeny (and it needs to be time-calibrated; the length of the branches needs to represent time). For purposes of example, we’re just going to generate one at random here and store it as an object called “tree”.
tree<-rtree(20, br = runif(1,5,10))
5. The next thing we need is a two column matrix of time bins, where the first column is the beginning of the time bin, the second is the end of it. This is not going to be useful for most people, but I needed it for my calculations of rate through time, so it is required for the function to work. So here I’m just creating 10 equal sized bins covering 10 million years each, and storing it as an object called “time”.
time<- matrix(ncol=2,nrow=10)
time[,1]<-seq(from=100,to=10,by=-10)
time[,2]<-seq(from=90,to=0,by=-10)
time
6. The last thing you might need (its optional) is an area adjacency matrix. Geography is being treated as a set of discrete entities, and a lineage is either present or absent in each region. No information on distance or region size is included. You could simulate it so that a lineage can disperse from any area to any area, but that might not be biologically realistic; for example, a flightless animal could not disperse from South Africa to Europe without passing through northern Africa. The area adjacency matrix can tell the simulation what areas are next to eachother and limit dispersal to between adjacent areas. It needs to be a pairwise, symmetrical matrix comparing each area to each area. A 1 indicates that two areas are adjacent, a 0 that they are not. Here’s one I’ve just made up, stored as an object called “adjacent”.
adjacent<-matrix(nrow=4,ncol=4,data=c(1,1,0,0,1,1,1,0,0,1,1,1,0,0,1,1))
adjacent
7. So, onto the diva.sim() function. It has eight arguments
- test.tree: the phylogeny you want to simulate your biogeographic history over
- no.areas: the number of regions in the simulated geospacer. Each will be named with the letters of the alphabet by the function, so the
maximum possible is 26
- max.areas: maximum number of areas in which a taxon, lineage, or node may be present. Defaults to the same as the number of areas.
- time.bins: a two column matrix giving the beginning and end of the time bins in which biogeographic events will be counted
- disp.rate: dispersal rate (in practice the probability that a dispersal event will occur at any point in time along a branch)
- vic.rate - vicariance rate (in practice, the probability that a vicariance event will take place at a node)
- ext.rate - local extinction rate (in practice the probability that a local extinction event will occur at any point in time along a branch)
8. Now, finally (sorry, this has been a long intro) we’ll run our simulation. Its going to have four areas, the area adjancy matrix, and the default rates of dispersal, vicariance and local extinction: 0.03. We’ll store the output as an object called “sim”
sim<-diva.sim(test.tree=tree,no.areas=4,max.areas=4,time.bins=time,area.adj=adjacent,disp.rate=0.03,vic.rate=0.03,ext.rate=0.03
9. So, what have we got? The output is a list with four subsections
- $node.states is a matrix where each row represents a node (both tip and internal are included, in the order of the original phylo object) and each column represents an area in which it lives
sim$node.states
- $disp.counts is a vector, giving the number of dispersal events in each time bin (the time bins being the ones we defined before, in that order).
- $vic.counts is the same for vicariance events
- $ext.counts is the same for local extinction events
sim$disp.counts
sim$vic.counts
sim$ext.counts