
Select Spatially Thinned Occurrences Using Moran's I Autocorrelation
Source:R/flag_geo_moran.R
flag_geo_moran.RdThis function evaluates multiple geographically thinned datasets (produced using different thinning distances) and selects the one that best balances low spatial autocorrelation and number of retained records.
For each thinning distance provided in d, the function computes Moran's I
for the selected environmental variables and summarizes autocorrelation using
a chosen statistic (mean, median, minimum, or maximum). The best thinning
level is then selected according to criteria described in Details.
Usage
flag_geo_moran(
occ,
species = "species",
long = "decimalLongitude",
lat = "decimalLatitude",
d,
distance = "haversine",
moran_summary = "mean",
min_records = 10,
min_imoran = 0.1,
prioritary_column = NULL,
decreasing = TRUE,
env_layers,
do_pca = FALSE,
mask = NULL,
pca_buffer = 1000,
return_all = FALSE,
verbose = TRUE
)Arguments
- occ
(data.frame or data.table) a data frame containing the occurrence records for a single species. Must contain columns for species, longitude, and latitude.
- species
(character) the name of the column in
occthat contains the species scientific names. Default is"species".- long
(character) the name of the column in
occthat contains the longitude values. Default is"decimalLongitude".- lat
(character) the name of the column in
occthat contains the latitude values. Default is"decimalLatitude".- d
(numeric) vector of thinning distances in kilometers (e.g., c(5, 10, 15, 20)).
- distance
(character) distance metric used to compute the weight matrix for Moran's I. One of
"haversine"or"euclidean". Default:"haversine".- moran_summary
(character) summary statistic used to select the best thinning distance. One of
"mean","median","max", or"min". Default:"mean".- min_records
(numeric) minimum number of records required for a dataset to be considered. Default:
10.- min_imoran
(numeric) minimum Moran's I required to avoid selecting datasets with extremely low spatial autocorrelation. Default:
0.1.- prioritary_column
(character) name of a numeric columns in
occto define retention priority (e.g., quality score, year). See details.- decreasing
(logical) whether to sort records in decreasing order using the
prioritary_column(e.g., from most recent to oldest when the variable is"year"). Only applicable whenprioritary_columnis notNULL. Default isTRUE.- env_layers
(SpatRaster) object containing environmental variables for computing Moran's I.
- do_pca
(logical) whether environmental variables should be summarized using PCA before computing Moran's I. Default:
FALSE. See details.- mask
(SpatVector or SpatExtent) optional spatial object to mask the
env_layersbefore computing PCA. Only applicable ifdo_pca = TRUE. Default is NULL.- pca_buffer
(numeric) buffer width (km) used when PCA is computed from the convex hull of records. Ignored if
maskis provided. Default:1000.- return_all
(logical) whether to return the full list of all thinned datasets. Default:
FALSE.- verbose
(logical) whether to print messages about the progress. Default is
TRUE
Value
A list with:
occ: the selected thinned occurrence dataset with the column
thin_geo_flagindicating whether each record is retained (TRUE) or flagged.imoran: a table summarizing Moran's I for each thinning distance
distance: the thinning distance that produced the selected dataset
moran_summary: the summary statistic used to select the dataset
all_thined: (optional) list of thinned datasets for all distances. Only returned if
return_allwas set toTRUE
Details
This function is inspired by the approach used in Velazco et al. (2021), extending the procedure by allowing:
prioritization of records based on a user-defined variable (e.g., year)
optional PCA transformation of environmental layers
selection rules that prevent datasets with too few records or extremely low Moran's I from being chosen.
Procedure overview
For each distance in
d, generate a spatially thinned dataset usingthin_geo()function.Extract environmental values for the retained records.
Compute Moran's I for each environmental variable.
Summarize autocorrelation per dataset (mean, median, min, or max).
Apply the selection criteria:
Keep only datasets with at least
min_recordsrecords.Keep only datasets with Moran's I higher than
min_imoran.Round Moran's I to two decimal places and select the dataset with the 25th lowest autocorrelation.
If more than on dataset is selected, choose the dataset retaining more records.
If still tied, choose the dataset with the smallest thinning distance.
Distance matrix for Moran's I Moran's I requires a weight matrix derived from pairwise distances among records. Two distance types are available:
"haversine": geographic distance computed withfields::rdist.earth()(default; recommended for longitude/latitude coordinates)"euclidean": Euclidean distance computed withstats::dist()
Environmental PCA (optional)
If do_pca = TRUE, the environmental layers are summarized using PCA before
Moran's I is computed.
If
maskis provided, PCA is computed on masked layers.Otherwise, a convex hull around the records is buffered by
pca_bufferkilometers to define the PCA area.It will select the axis that together explain more than 90% of the variation.
References
Velazco, S. J. E., Svenning, J. C., Ribeiro, B. R., & Laureto, L. M. O. (2021). On opportunities and threats to conserve the phylogenetic diversity of Neotropical palms. Diversity and Distributions, 27(3), 512–523. https://doi.org/10.1111/ddi.13215
Examples
# Load example data
data("occurrences", package = "RuHere")
# Subset occurrences from Araucaria
occ <- occurrences[occurrences$species == "Araucaria angustifolia", ]
# Load example of raster variables
data("worldclim", package = "RuHere")
# Unwrap Packed raster
r <- terra::unwrap(worldclim)
# Select thinned occurrences
occ_geo_moran <- flag_geo_moran(occ = occ, d = c(5, 10, 20, 30),
env_layers = r)
#> Filtering records...
#> Calculating spatial autocorrelation using Moran Index...
# Selected distance
occ_geo_moran$distance
#> [1] "30"
# Number of flagged and unflagged records
sum(occ_geo_moran$occ$thin_geo_flag) #Retained
#> [1] 297
sum(!occ_geo_moran$occ$thin_geo_flag) #Flagged for thinning out
#> [1] 627
# Results os the spatial autocorrelation analysis
occ_geo_moran$imoran
#> species Distance bio_1 bio_7 bio_12 median_moran
#> 5 Araucaria angustifolia 5 0.1678046 0.3300641 0.1719100 0.1719100
#> 10 Araucaria angustifolia 10 0.1511561 0.3003992 0.1665782 0.1665782
#> 20 Araucaria angustifolia 20 0.1318548 0.2699369 0.1473847 0.1473847
#> 30 Araucaria angustifolia 30 0.1072336 0.2590974 0.1499994 0.1499994
#> mean_moran min_moran max_moran n_filtered all_records prop_lost
#> 5 0.2232595 0.1678046 0.3300641 836 924 0.0952381
#> 10 0.2060445 0.1511561 0.3003992 621 924 0.3279221
#> 20 0.1830588 0.1318548 0.2699369 407 924 0.5595238
#> 30 0.1721101 0.1072336 0.2590974 297 924 0.6785714