Chapter 5 SDM Data Frame
5.1 Set-up
This example will use the following libraries:
Load the shapefiles, Trillium data and variables raster stack created earlier.
5.2 Variable values for presences
For fitting an SDM, I need a data frame where each row is a grid cell (in my NHVT raster) where the species has been observed (so presence = 1) and I have the variable value for that cell in the other columns. So the idea is to get the variable values for each of these observations.
plot(nhvtshp, border = "blue", axes = TRUE)
plot(subset(trillium, species == "Trillium grandiflorum"), pch = 3,
cex = 1, add = TRUE)
plot(subset(trillium, species == "Trillium undulatum"), pch = 4,
cex = 1, col = "red", add = TRUE)
plot(hbshp, add = TRUE, border = "blue")
EXCEPT that the variable data is on a grid while the observation data is points with lat/lon values. What I need is a data frame where each row is a grid cell where Trillium was observed (either once or many times) and the variable values for that cell.
The following code gets you to that data frame.
5.2.1 Get the variable data
I can create a data frame with the values of the variables where Trillium locations are using the extract()
function. This will take a lat/lon value, figure out what cell that lat/lon pair is in, and return the variable values for that cell. The function takes a raster layer or stack of layers and the point location data (as a spatial points object) and returns the values for each layer in a column. I use cellnumbers=TRUE
to return what cell that lat/lon pair is in. This will allow me to eliminate duplicates (observations from the same cell).
5.2.2 Add on the species, pa, and lon/lat info
I add on columns for the species name and “presence-absence”.
trillVars$species <- trillium$species
trillVars$pa <- 1
trillVars$lon <- raster::xFromCell(r, trillVars$cells)
trillVars$lat <- raster::yFromCell(r, trillVars$cells)
These are for grid cells. For later plotting, I might want the center of the cells so I’ll add that on.
5.2.3 Check for duplicates
Duplicates means multiple occurrences assigned to the same cell. We can find this by seeing how many values in the cells
column are duplicates. There are many. Our raster cells are 0.5 x 0.5 degrees (so 1km or so?). Trillium occurrences in that same cell (lat/lon values that are in the same 0.5 square grid) will have the same cell number so will be “duplicates”, meaning an occurrence in a cell where there is already another occurrence.
# this is where both cell and species are the same
dups <- duplicated(cbind(trillVars$cells, trillVars$species))
sum(dups)
[1] 478
I get rid of them by saying take the non-duplicated cell values.
Now I have about half the number of lines of data.
[1] 750 41
5.3 Background points
SDMs for presence only data need a set of samples from the background area where it is possible that the species could have been observed. Technically, we’d want to weight this background by where searching was more likely or at least remove regions where it is impossible to observe the species. Trillium won’t be observed in urban areas and water bodies so we might want to create a mask of impossible areas not include background from there. But to keep things simple, for now I’ll just sample randomly from my NHVT bounding box.
I will add on 5000 random background points by sampling from all the cells in allVars
. I want to keep the lon/lat information (cell centers in this case).
I need to fix the first three column names since I will be appending this data frame to the one above and that one uses cells
as the cell column name. Also add on species name and presence-absence info.
To make it easier to create my training and testing datasets for the two species, I make two copies of the background with a different species value for each.
background <- rbind(background, background)
background$species <- c(rep("Trillium grandiflorum", 5000), rep("Trillium undulatum",
5000))
In order to bind this data frame to the presence one, I need the column names to also be in the same order. I need to fix that since sampleRandom()
put the lat/lon columns in columns 2 and 3. The first line of code is selecting columns from background
in the order shown after the ,
then I check that I did it right and the colnames are identical
background <- background[, colnames(trillVars)]
identical(colnames(background), colnames(trillVars))
[1] TRUE
5.4 Make final data frame
Now I bind this two data frames together for final data frame with occurrences, climatic data, and my background zeros.
The rownames are annoying so I will set to NULL.
Now I have my final data frame with the following column names.
[1] "cells" "mean.temp"
[3] "temp.diurnal.range" "isotherm"
[5] "temp.seasonality" "max.warm.temp"
[7] "min.cold.temp" "temp.annual.range"
[9] "mean.temp.wet.qtr" "mean.temp.dry.qtr"
[11] "mean.temp.warm.qtr" "mean.temp.cold.qtr"
[13] "total.precip" "precip.wet.month"
[15] "precip.dry.month" "precip.seasonality"
[17] "precip.wet.qtr" "precip.dry.qtr"
[19] "precip.warm.qtr" "precip.cold.qtr"
[21] "elevation" "slope"
[23] "aspect" "Mixed.Needleleaf.Trees"
[25] "Evergreen.Broadleaf.Trees" "Deciduous.Broadleaf.Trees"
[27] "Mixed.Other.Trees" "Shrubs"
[29] "Herbaceous" "Cultivated"
[31] "Flooded" "Urban"
[33] "Snow" "Barren"
[35] "Water" "Dominant.Land.Cover"
[37] "Tree.Cover" "species"
[39] "pa" "lon"
[41] "lat"
5.5 Save
I save the data frame. dat
is a non-descriptive name and a little dangerous to use since the user might have dat
already in their working environment. I will also make dat.und
and dat.grand
for the two Trillium species. There are some NAs in the slope aspect data which I will get rid of.
dat.und <- subset(dat, species == "Trillium undulatum")
dat.und <- na.omit(dat.und)
dat.grand <- subset(dat, species == "Trillium grandiflorum")
dat.grand <- na.omit(dat.grand)
Save.