Skip to contents

Introduction

Even after initial formatting, species occurrence data often retain spatial inconsistencies that can compromise subsequent analyses. Common issues include varying spellings for the same country (i.e., Brasil, Brazil or BR) or state name, missing administrative information, or coordinates that fall outside the political-administrative jurisdiction assigned to the record. This vignette demonstrates how to ensure the spatial consistency of your occurrence records by addressing name standardization, data imputation, verification, and correction.

# Load RuHere package
library(RuHere)

Overview of the functions:

Standardizing country and state names

Standardizing administrative names is the first step to ensure that all spelling variations and codes are mapped to a single accepted format.

Occurrence data

At this stage, you should have an occurrence dataset that has been standardized using the format_columns() function and merged with bind_here(). For additional details on this workflow, see the vignette “1. Obtaining and preparing species occurrence data”.

To illustrate how the function works, we use the example occurrence dataset included in the package, which contains records for three species: the Paraná pine (Araucaria angustifolia), the azure jay (Cyanocorax caeruleus), and the yellow trumpet tree (Handroanthus albus).

# Loading package occurrence data
data("occurrences", package = "RuHere")
# Number of records per species
table(occurrences$species)
#> 
#>     Araucaria angustifolia       Cyanocorax caeruleus 
#>                        924                       1035 
#> Handroanthus serratifolius 
#>                       2121

Standardizing countries (standardize_countries)

This function harmonizes country names using exact matching and fuzzy matching to correct typos and variations. It compares the input against a comprehensive dictionary of names and codes provided in rnaturalearthdata::map_units110().

# Standardize country names
occ_country_std <- standardize_countries(
    occ = occurrences,
    country_column = "country",
    max_distance = 0.1,      # Maximum error distance for fuzzy matching
    lookup_na_country = TRUE # Try to extract country from coords if value is 
    # NA using the country_from_coords() function internally
)

This function returns a list with two elements:

  • $occ: the original data frame with two new columns: country_suggested (the standardized or corrected country name) and country_source (whether the suggested country came from the original metadata or was imputed from coordinates).

  • $report: a summary of the corrections made, showing the original name and the suggested/standardized name.

Below are the first few rows of the modified data frame and the standardization report:

# Printing first rows and columns
occ_country_std$occ[1:3, 1:5]
#>   country country_suggested country_source  record_id               species
#> 1      AR         argentina       metadata  gbif_5516  Araucaria angustifolia
#> 2      AR         argentina       metadata gbif_15849  Araucaria angustifolia
#> 3      AR         argentina       metadata  gbif_4935  Araucaria angustifolia

occ_country_std$report[1:5, ]
#>      country country_suggested
#> 1  argentina         argentina
#> 2    bolivia           bolivia
#> 3     brasil            brazil
#> 4         UY           uruguay
#> 5         PT          portugal

Standardizing states (standardize_states)

Similarly, this function standardizes state or province names. It uses the previously standardized country column (country_suggested) to disambiguate states that might share names across different countries, using as reference the names and postal codes provided in rnaturalearthdata::states50().

# Standardize state names
occ_state_std <- standardize_states(
    occ = occ_country_std$occ,
    state_column = "stateProvince",
    country_column = "country_suggested",
    max_distance = 0.1,
    lookup_na_state = TRUE # Try to extract state from coords if value is NA
)

Like standardize_countries(), the standardize_states() function returns a list with two elements:

  • $occ: the input data frame with two new columns: state_suggested (the standardized or corrected state/province name) and state_source (indicates whether the suggested state came from the original metadata or was imputed from coordinates).

  • $report: a summary table of the corrections and standardizations made, showing the original name and the suggested name, constrained by the suggested country.

Below are the first few rows of the modified data frame and the standardization report:

occ_state_std$occ[1:3, 1:6]
#>   stateProvince state_suggested state_source country_suggested country country_source
#> 1          acre            acre     metadata            brazil  brazil       metadata
#> 2          acre            acre     metadata            brazil  brazil       metadata
#> 3          acre            acre     metadata            brazil  brazil       metadata

occ_state_std$report[1:3, ]
#>       stateProvince           state_suggested  country_suggested
#> 1        sa£o paulo                 sao paulo             brazil
#> 2         tocantins                 tocantins             brazil
#> 3               RS          rio grande do sul             brazil

Imputing geographic information from coordinates

Sometimes, records have valid coordinates but lack administrative labels entirely. We can use spatial intersection to retrieve this information.

Extracting country from coordinates (country_from_coords)

This function uses geographic coordinates (long, lat) and a reference world map (rnaturalearthdata::map_units110()) to determine the country for each point.

# Explicitly extract country from coordinates for all records
occ_with_country_xy <- country_from_coords(
    occ = occ_state_std$occ,
    from = "all", # 'all' extracts for every record; 'na_only' extracts for missing ones
    output_column = "country_xy"
)

# Compare the original country vs. the one derived from coordinates
head(occ_with_country_xy[, c("country", "country_xy")])
#>   country country_xy
#> 1  brazil     brazil
#> 2  brazil     brazil
#> 3  brazil     brazil
#> 4      BR     brazil
#> 5      BR     brazil
#> 6      BR     brazil

Extracting state from coordinates (states_from_coords)

Similarly, we can extract state or province names. Here, we demonstrate filling all records (from = "all") and appending a source column to track where the data came from.

# Extract state from coordinates for all records
occ_imputed <- states_from_coords(
    occ = occ_with_country_xy,
    from = "all",
    state_column = "stateProvince",
    output_column = "state_xy"
)

head(occ_imputed[, c("stateProvince", "state_xy", "state_source")])
#>   stateProvince state_xy state_source
#> 1          acre     acre     metadata
#> 2          acre     acre     metadata
#> 3          acre     acre     metadata
#> 4          acre amazonas     metadata
#> 5          acre     acre     metadata
#> 6          acre     acre     metadata

Checking and fixing spatial inconsistencies

A critical quality control step is verifying whether the coordinates actually fall within the administrative unit assigned to them. Discrepancies often indicate errors in either the label or the coordinates.

Checking country consistency (check_countries)

This function compares the coordinates against the boundaries of the country assigned in the country_suggested column.

# Check if coordinates fall within the assigned country
occ_checked_country <- check_countries(
    occ = occ_imputed,
    country_column = "country_suggested",
    distance = 5,      # Allows a 5 km buffer for border points
    try_to_fix = TRUE  # Automatically attempts to fix inverted/swapped coordinates
)
#> Testing countries...
#> 468 records fall in wrong countries
#> Task 1 of 7: testing if longitude is inverted
#> 0 coordinates with longitude inverted
#> Task 2 of 7: testing if latitude is inverted
#> 0 coordinates with latitude inverted
#> Task 3 of 7: testing if longitude and latitude are inverted
#> 2 coordinates with longitude and latitude inverted
#> Task 4 of 7: testing if longitude and latitude are swapped
#> 1 coordinates with longitude and latitude swapped
#> Task 5 of 7: testing if longitude and latitude are swapped with longitude inverted
#> 0 coordinates with longitude and latitude swapped and latitude inverted
#> Task 6 of 7: testing if longitude and latitude are swapped - with latitude inverted
#> 0 coordinates with longitude and latitude swapped and longitude inverted
#> Task 7 of 7: testing if longitude and latitude are swapped - with longitude latitude inverted
#> 0 coordinates with longitude and latitude swapped and inverted

# The 'correct_country' column indicates validity
head(occ_checked_country[, c("country_suggested", "correct_country", "country_issues")])
#>   country_suggested correct_country country_issues
#> 1            brazil            TRUE        correct
#> 2            brazil            TRUE        correct
#> 3            brazil            TRUE        correct
#> 4            brazil            TRUE        correct
#> 5            brazil            TRUE        correct
#> 6            brazil            TRUE        correct

The column correct_country is added, indicating TRUE if the point falls within the country. Because we set try_to_fix = TRUE, the function internally calls fix_countries() to identify and correct errors like swapped latitude/longitude, recording the action in country_issues.

Checking state consistency (check_states)

We perform a similar verification for states. Note that check_states verifies points against the state_suggested column.

# Check if coordinates fall within the assigned state
occ_checked_state <- check_states(
    occ = occ_checked_country,
    state_column = "state_suggested",
    distance = 5,
    try_to_fix = FALSE # We just want to flag issues here, not auto-fix
)
#> Testing states...
#> 87 records fall in wrong states

head(occ_checked_state[, c("state_suggested", "correct_state")])
#>   state_suggested correct_state
#> 1            acre          TRUE
#> 2            acre          TRUE
#> 3            acre          TRUE
#> 4            acre         FALSE
#> 5            acre          TRUE
#> 6            acre          TRUE

The correct_country and correct_states columns represent the first set of flags: records marked as FALSE indicate potentially erroneous entries. For additional details on how to explore and remove flagged records, see the vignette “3. Flagging Records Using Record Information”.

Fixing coordinate errors explicitly (fix_countries)

If you prefer to run the fixing process separately (instead of inside check_countries), you can use fix_countries(). This function runs seven distinct tests to detect issues such as inverted signs or swapped coordinates.

# This step is only necessary if you did NOT set try_to_fix = TRUE above
fixing_example <- fix_countries(
   occ = occ_checked_country,
   country_column = "country_suggested",
   correct_country = "correct_country" # Column created by check_countries
)
#> Task 1 of 7: testing if longitude is inverted
#> 0 coordinates with longitude inverted
#> Task 2 of 7: testing if latitude is inverted
#> 0 coordinates with latitude inverted
#> Task 3 of 7: testing if longitude and latitude are inverted
#> 0 coordinates with longitude and latitude inverted
#> Task 4 of 7: testing if longitude and latitude are swapped
#> 0 coordinates with longitude and latitude swapped
#> Task 5 of 7: testing if longitude and latitude are swapped with longitude inverted
#> 0 coordinates with longitude and latitude swapped and latitude inverted
#> Task 6 of 7: testing if longitude and latitude are swapped - with latitude inverted
#> 0 coordinates with longitude and latitude swapped and longitude inverted
#> Task 7 of 7: testing if longitude and latitude are swapped - with longitude latitude inverted
#> 0 coordinates with longitude and latitude swapped and inverted

Records identified as “inverted” or “swapped” are corrected in place, and the country_issues column is updated to reflect the specific error type found.

Now that we can have our dataset with the countries and states standardized and checked, we can go to the next step: 3. Flagging Records Using Associated Information”.