
2. Ensuring spatial consistency: countries, states, and coordinates
Source:vignettes/spatial_consistency.Rmd
spatial_consistency.RmdIntroduction
Even after initial formatting, species occurrence data often retain spatial inconsistencies that can compromise subsequent analyses. Common issues include varying spellings for the same country (i.e., Brasil, Brazil or BR) or state name, missing administrative information, or coordinates that fall outside the political-administrative jurisdiction assigned to the record. This vignette demonstrates how to ensure the spatial consistency of your occurrence records by addressing name standardization, data imputation, verification, and correction.
Overview of the functions:
-
standardize_countries(): standardizes country names and codes. -
standardize_states(): standardizes state/province names and codes. -
country_from_coords(): extracts the country name from geographic coordinates. -
states_from_coords(): extracts the state/province name from geographic coordinates. -
check_countries(): verifies if coordinates fall within the boundaries of the assigned country. -
check_states(): verifies if coordinates fall within the boundaries of the assigned state/province. -
fix_countries(): identifies and corrects common coordinate errors based on country jurisdiction.
Standardizing country and state names
Standardizing administrative names is the first step to ensure that all spelling variations and codes are mapped to a single accepted format.
Occurrence data
At this stage, you should have an occurrence dataset that has been
standardized using the format_columns() function and merged
with bind_here(). For additional details on this workflow,
see the vignette “1. Obtaining and preparing species occurrence
data”.
To illustrate how the function works, we use the example occurrence dataset included in the package, which contains records for three species: the Paraná pine (Araucaria angustifolia), the azure jay (Cyanocorax caeruleus), and the yellow trumpet tree (Handroanthus albus).
Standardizing countries (standardize_countries)
This function harmonizes country names using exact matching and fuzzy
matching to correct typos and variations. It compares the input against
a comprehensive dictionary of names and codes provided in
rnaturalearthdata::map_units110().
# Standardize country names
occ_country_std <- standardize_countries(
occ = occurrences,
country_column = "country",
max_distance = 0.1, # Maximum error distance for fuzzy matching
lookup_na_country = TRUE # Try to extract country from coords if value is
# NA using the country_from_coords() function internally
)This function returns a list with two elements:
$occ: the original data frame with two new columns:country_suggested(the standardized or corrected country name) andcountry_source(whether the suggested country came from the original metadata or was imputed from coordinates).$report: a summary of the corrections made, showing the original name and the suggested/standardized name.
Below are the first few rows of the modified data frame and the standardization report:
# Printing first rows and columns
occ_country_std$occ[1:3, 1:5]
#> country country_suggested country_source record_id species
#> 1 AR argentina metadata gbif_5516 Araucaria angustifolia
#> 2 AR argentina metadata gbif_15849 Araucaria angustifolia
#> 3 AR argentina metadata gbif_4935 Araucaria angustifolia
occ_country_std$report[1:5, ]
#> country country_suggested
#> 1 argentina argentina
#> 2 bolivia bolivia
#> 3 brasil brazil
#> 4 UY uruguay
#> 5 PT portugalStandardizing states (standardize_states)
Similarly, this function standardizes state or province names. It
uses the previously standardized country column
(country_suggested) to disambiguate states that might share
names across different countries, using as reference the names and
postal codes provided in rnaturalearthdata::states50().
# Standardize state names
occ_state_std <- standardize_states(
occ = occ_country_std$occ,
state_column = "stateProvince",
country_column = "country_suggested",
max_distance = 0.1,
lookup_na_state = TRUE # Try to extract state from coords if value is NA
)Like standardize_countries(), the
standardize_states() function returns a list with two
elements:
$occ: the input data frame with two new columns:state_suggested(the standardized or corrected state/province name) andstate_source(indicates whether the suggested state came from the original metadata or was imputed from coordinates).$report: a summary table of the corrections and standardizations made, showing the original name and the suggested name, constrained by the suggested country.
Below are the first few rows of the modified data frame and the standardization report:
occ_state_std$occ[1:3, 1:6]
#> stateProvince state_suggested state_source country_suggested country country_source
#> 1 acre acre metadata brazil brazil metadata
#> 2 acre acre metadata brazil brazil metadata
#> 3 acre acre metadata brazil brazil metadata
occ_state_std$report[1:3, ]
#> stateProvince state_suggested country_suggested
#> 1 sa£o paulo sao paulo brazil
#> 2 tocantins tocantins brazil
#> 3 RS rio grande do sul brazilImputing geographic information from coordinates
Sometimes, records have valid coordinates but lack administrative labels entirely. We can use spatial intersection to retrieve this information.
Extracting country from coordinates
(country_from_coords)
This function uses geographic coordinates (long,
lat) and a reference world map
(rnaturalearthdata::map_units110()) to determine the
country for each point.
# Explicitly extract country from coordinates for all records
occ_with_country_xy <- country_from_coords(
occ = occ_state_std$occ,
from = "all", # 'all' extracts for every record; 'na_only' extracts for missing ones
output_column = "country_xy"
)
# Compare the original country vs. the one derived from coordinates
head(occ_with_country_xy[, c("country", "country_xy")])
#> country country_xy
#> 1 brazil brazil
#> 2 brazil brazil
#> 3 brazil brazil
#> 4 BR brazil
#> 5 BR brazil
#> 6 BR brazilExtracting state from coordinates
(states_from_coords)
Similarly, we can extract state or province names. Here, we
demonstrate filling all records (from = "all") and
appending a source column to track where the data came from.
# Extract state from coordinates for all records
occ_imputed <- states_from_coords(
occ = occ_with_country_xy,
from = "all",
state_column = "stateProvince",
output_column = "state_xy"
)
head(occ_imputed[, c("stateProvince", "state_xy", "state_source")])
#> stateProvince state_xy state_source
#> 1 acre acre metadata
#> 2 acre acre metadata
#> 3 acre acre metadata
#> 4 acre amazonas metadata
#> 5 acre acre metadata
#> 6 acre acre metadataChecking and fixing spatial inconsistencies
A critical quality control step is verifying whether the coordinates actually fall within the administrative unit assigned to them. Discrepancies often indicate errors in either the label or the coordinates.
Checking country consistency (check_countries)
This function compares the coordinates against the boundaries of the
country assigned in the country_suggested column.
# Check if coordinates fall within the assigned country
occ_checked_country <- check_countries(
occ = occ_imputed,
country_column = "country_suggested",
distance = 5, # Allows a 5 km buffer for border points
try_to_fix = TRUE # Automatically attempts to fix inverted/swapped coordinates
)
#> Testing countries...
#> 468 records fall in wrong countries
#> Task 1 of 7: testing if longitude is inverted
#> 0 coordinates with longitude inverted
#> Task 2 of 7: testing if latitude is inverted
#> 0 coordinates with latitude inverted
#> Task 3 of 7: testing if longitude and latitude are inverted
#> 2 coordinates with longitude and latitude inverted
#> Task 4 of 7: testing if longitude and latitude are swapped
#> 1 coordinates with longitude and latitude swapped
#> Task 5 of 7: testing if longitude and latitude are swapped with longitude inverted
#> 0 coordinates with longitude and latitude swapped and latitude inverted
#> Task 6 of 7: testing if longitude and latitude are swapped - with latitude inverted
#> 0 coordinates with longitude and latitude swapped and longitude inverted
#> Task 7 of 7: testing if longitude and latitude are swapped - with longitude latitude inverted
#> 0 coordinates with longitude and latitude swapped and inverted
# The 'correct_country' column indicates validity
head(occ_checked_country[, c("country_suggested", "correct_country", "country_issues")])
#> country_suggested correct_country country_issues
#> 1 brazil TRUE correct
#> 2 brazil TRUE correct
#> 3 brazil TRUE correct
#> 4 brazil TRUE correct
#> 5 brazil TRUE correct
#> 6 brazil TRUE correctThe column correct_country is added, indicating
TRUE if the point falls within the country. Because we set
try_to_fix = TRUE, the function internally calls
fix_countries() to identify and correct errors like swapped
latitude/longitude, recording the action in
country_issues.
Checking state consistency (check_states)
We perform a similar verification for states. Note that
check_states verifies points against the
state_suggested column.
# Check if coordinates fall within the assigned state
occ_checked_state <- check_states(
occ = occ_checked_country,
state_column = "state_suggested",
distance = 5,
try_to_fix = FALSE # We just want to flag issues here, not auto-fix
)
#> Testing states...
#> 87 records fall in wrong states
head(occ_checked_state[, c("state_suggested", "correct_state")])
#> state_suggested correct_state
#> 1 acre TRUE
#> 2 acre TRUE
#> 3 acre TRUE
#> 4 acre FALSE
#> 5 acre TRUE
#> 6 acre TRUEThe correct_country and correct_states
columns represent the first set of flags: records marked as FALSE
indicate potentially erroneous entries. For additional details on how to
explore and remove flagged records, see the vignette “3. Flagging
Records Using Record Information”.
Fixing coordinate errors explicitly
(fix_countries)
If you prefer to run the fixing process separately (instead of inside
check_countries), you can use fix_countries().
This function runs seven distinct tests to detect issues such as
inverted signs or swapped coordinates.
# This step is only necessary if you did NOT set try_to_fix = TRUE above
fixing_example <- fix_countries(
occ = occ_checked_country,
country_column = "country_suggested",
correct_country = "correct_country" # Column created by check_countries
)
#> Task 1 of 7: testing if longitude is inverted
#> 0 coordinates with longitude inverted
#> Task 2 of 7: testing if latitude is inverted
#> 0 coordinates with latitude inverted
#> Task 3 of 7: testing if longitude and latitude are inverted
#> 0 coordinates with longitude and latitude inverted
#> Task 4 of 7: testing if longitude and latitude are swapped
#> 0 coordinates with longitude and latitude swapped
#> Task 5 of 7: testing if longitude and latitude are swapped with longitude inverted
#> 0 coordinates with longitude and latitude swapped and latitude inverted
#> Task 6 of 7: testing if longitude and latitude are swapped - with latitude inverted
#> 0 coordinates with longitude and latitude swapped and longitude inverted
#> Task 7 of 7: testing if longitude and latitude are swapped - with longitude latitude inverted
#> 0 coordinates with longitude and latitude swapped and invertedRecords identified as “inverted” or “swapped” are corrected in place,
and the country_issues column is updated to reflect the
specific error type found.
Now that we can have our dataset with the countries and states standardized and checked, we can go to the next step: 3. Flagging Records Using Associated Information”.