3  Data Sources Overview

3.1 Canadian Outline

3.1.1 Source and Description

The dataset, titled Canada Outline, was generated to create a simplified polygon of Canada’s national boundaries. This dataset was produced using the GADM Level 1 administrative boundary data for Canada, with the geometry simplified for efficient large-scale spatial analysis (Global Administrative Areas, 2023). The resulting GeoPackage file represents Canada and its provinces/territories with reduced geometric complexity. The dataset is suitable for mapping and spatial analysis tasks requiring simplified boundaries.

3.1.2 Processing Steps

The processed data was generated using the prc_canada_outline script. The following steps were undertaken:

  1. Data Acquisition:
    • GADM Level 1 administrative boundary data for Canada was retrieved using the geodata::gadm function.
  2. Geometry Simplification:
    • Simplified the polygon geometry to reduce data complexity using the terra::simplifyGeom function with a tolerance of 0.1.
  3. Output Generation:
    • The simplified boundary dataset was saved as can_1_simplified.gpkg in GeoPackage format.

3.1.3 Processed Data Structure

The structure of the simplified dataset is as follows:

Table 3.1: Structure of the simplified Canada outline dataset
Column_Name Description
gid_1 Unique identifier for the first administrative level.
gid_0 Country code for Canada.
country Name of the country (Canada).
name_1 Name of the first administrative level (Province/Territory).
varname_1 Alternative or variant names for the region.
nl_name_1 Native language names for the region.
type_1 Type of administrative unit (Province/Territory).
engtype_1 English type name for the administrative unit.
cc_1 Internal code for the administrative region.
hasc_1 Hierarchical administrative subdivision code.
iso_1 ISO code for the administrative region.
geom Simplified polygon geometry for the region.

3.2 Canadian Freshwater Fish Species

3.2.1 Source and Description

The dataset, titled FishBase Freshwater Species Checklist, was harvested from FishBase (Froese, R. and Pauly, D., 2024). This dataset provides information on freshwater fish species found across Canada. It includes essential taxonomic and ecological details, such as species name, vernacular name, family, order, and occurrence status. Additional datasets retrieved from FishBase include species status (e.g., threat categories from the IUCN Red List of Threatened Species), and whether the species is classified as game or commercial. The source data consists of HTML tables retrieved through web scraping and transformed into a structured format for analysis.

3.2.2 Processing Steps

The harvested raw data was processed using the prc_freshwater_fish_canada script. The following steps were performed:

  1. Data Cleaning:
    • Column names in all input files were standardized using the janitor::clean_names function.
    • Unnecessary columns, such as name_in_country, were removed.
  2. Field Selection and Renaming:
    • The fish_base_name column was renamed to vernacular for clarity.
  3. Integration:
    • Species occurrence data was joined with additional tables for status, game species, and commercial species using the species column as a key.
  4. Data Transformation:
    • A unique species_id was generated for each species record.
  5. Data Export:
    • The fully integrated and processed dataset was saved as freshwater_fish_species_canada.csv.

3.2.3 Processed Data Structure

The processed dataset retains the following structure:

Table 3.2: Structure of the processed freshwater fish species dataset for Canada
Column_Name Description
species_id Unique identifier for each fish species entry.
species Scientific name of the fish species.
vernacular Common name of the fish species.
order Taxonomic order to which the species belongs.
family Taxonomic family to which the species belongs.
occurrence Occurrence status of the species in Canada (e.g., native).
threat_category Threat category assigned to the species, if applicable (e.g., Vulnerable (VU), Endangered (EN)).
game Indicator of whether the species is classified as a game species (1 = yes, NA = no).
commercial Indicator of whether the species is classified as a commercial species (1 = yes, NA = no).

3.3 Canadian Hydrological Features

3.3.1 Source and Description

The dataset, titled Atlas of Canada National Scale Data 1:1,000,000 - Waterbodies & Rivers, was harvested from Natural Resources Canada (Canada, 2022b, 2022a). This comprehensive GIS dataset includes spatial and tabular data representing Canada’s hydrological features. It consists of waterbodies and river datasets compiled for large-scale mapping at a 1:1,000,000 scale. The processed dataset provides geospatial layers of lakes and rivers, including both their polygonal representations and sampled points for further spatial analyses.

3.3.2 Processing Steps

The harvested raw data was processed using the prc_atlas_of_canada_hydrology script. The following steps were performed:

  1. Unzipping: The compressed GDB files were extracted to a temporary directory for processing.
  2. Lake Processing:
    • Loaded the “AC_1M_Waterbodies” layer using the sf::st_read function.
    • Calculated the area and perimeter of each lake using sf::st_area and lwgeom::st_perimeter_lwgeom.
    • Filter lakes to keep only lakes with area > 5 \(km^2\).
    • Generated a unique waterbody_id for each lake and classified it as type lake.
    • Retained relevant columns: waterbody_id, wb_type, name, name_fr, area, and perimeter.
    • Transformed the data to CRS 4326 and exported it as lakes_polygons.gpkg.
    • Generated centroids for each lake and exported the point layer as lakes_points.gpkg.
  3. River Processing:
    • Loaded the “AC_1M_Rivers_dense” layer using the sf::st_read function.
    • Calculated the length of each river segment using sf::st_length.
    • Filter lakes to keep only lakes with length > 10 km.
    • Generated a unique waterbody_id for each river and classified it as type river.
    • Retained relevant columns: waterbody_id, wb_type, name, name_fr, and length.
    • Transformed the data to CRS 4326 and exported it as rivers_lines.gpkg.
    • Sampled points along each river segment at the mid-point of each rivers, cast the lines to points, and exported as rivers_points.gpkg.
  4. Cleanup: Temporary files created during processing were deleted.

3.3.3 Processed Data Structure

The structure of the processed lake polygon dataset is as follows:

Table 3.3: Structure of the processed lake polygon dataset for Canada
Column_Name Description
waterbody_id Unique identifier for each lake feature.
wb_type Type of waterbody, always ‘lake’ for this dataset.
name Name of the lake in English.
name_fr Name of the lake in French.
area Surface area of the lake in square kilometers.
perimeter Perimeter of the lake in kilometers.

The structure of the processed lake point dataset is as follows:

Table 3.4: Structure of the processed lake point dataset for Canada
Column_Name Description
waterbody_id Unique identifier for each lake feature.
geom Centroid geometry of the lake in geographic coordinates (CRS 4326).

The structure of the processed river line dataset is as follows:

Table 3.5: Structure of the processed river line dataset for Canada
Column_Name Description
waterbody_id Unique identifier for each river segment.
wb_type Type of waterbody, always ‘river’ for this dataset.
name Name of the river in English.
name_fr Name of the river in French.
length Length of the river segment in kilometers.

The structure of the processed river point dataset is as follows:

Table 3.6: Structure of the processed river point dataset for Canada
Column_Name Description
waterbody_id Unique identifier for each river segment.
geom Sampled point geometry at the mid-point of the river.

3.4 Freshwater Fish Species Occurrences

3.4.1 Source and Description

The dataset, titled Global Freshwater Fish Species Occurrences Database, was retrieved from Figshare (Tedesco et al., 2017a). This comprehensive dataset compiles freshwater fish species occurrences aggregated at the drainage basin level. The data is associated with the data paper by Tedesco et al. (2017b) and is designed for analyzing global freshwater biodiversity patterns.

The dataset includes both spatial data (drainage basins) and tabular data (species occurrences). The drainage basin data covers 3,119 polygons worldwide, while the occurrence table provides over 110,000 records of species distributions across basins. The data has been validated using established taxonomic references such as FishBase.

3.4.2 Processing Steps

The harvested data was processed using the prc_freshwater_fish_occurrences script. The following steps were performed:

  1. Data Extraction:
    • The compressed dataset (freshwater_fish_occurrences.zip) was unzipped into a temporary directory.
  2. Drainage Basin Data Processing:
    • The shapefile Basin042017_3119.shp was loaded using the sf::st_read function.
    • Column names were cleaned and renamed to remove unnecessary prefixes using janitor::clean_names and dplyr::rename_with.
    • The cleaned data was exported as bassins.gpkg in GeoPackage format.
  3. Species Occurrence Data Processing:
    • The Occurrence_Table.csv file was loaded using the vroom::vroom function.
    • Column names were cleaned and renamed similarly to the drainage basin data.
    • Species names in the species_name_in_source and fishbase_valid_species_name columns were updated to replace periods (.) with spaces.
    • The processed data was exported as occurrences.csv.

3.4.3 Processed Data Structure

The structure of the processed drainage basin dataset (bassins.gpkg) is as follows:

Table 3.7: Structure of the processed drainage basin dataset
Column_Name Description
basin_name Name of the drainage basin.
country Country where the basin is located.
ecoregion Ecoregion associated with the basin.
endorheic Indicator of whether the basin is endorheic (closed basin).
out_longit Longitude of the basin outlet.
out_latit Latitude of the basin outlet.
med_longit Median longitude of the basin.
med_latit Median latitude of the basin.
surf_area Surface area of the basin in square kilometers.
geometry Polygon geometry of the basin.

The structure of the processed species occurrence dataset (occurrences.csv) is as follows:

Table 3.8: Structure of the processed species occurrence dataset
Column_Name Description
basin_name Name of the drainage basin.
species_name_in_source Species name as recorded in the source.
native_exotic_status Status of the species in the basin (native or exotic).
tsn_itis_code Taxonomic Serial Number (TSN) from ITIS.
fishbase_species_code Species code from FishBase.
fishbase_valid_species_name Valid species name from FishBase.
occurrence_status Occurrence status of the species in the basin (valid or invalid).

3.5 GBIF Species Occurrences

3.5.1 Source and Description

The dataset, titled GBIF Occurrence Data, was retrieved from the Global Biodiversity Information Facility (GBIF) (Global Biodiversity Information Facility, 2025). This dataset compiles species occurrence records for a specified taxonomic group and geographic region. The retrieved dataset includes spatial and tabular data for over 850,000 records of species occurrences across Canada, starting from 2000.

It should be noted that the processing steps described below show how the data is downloaded and processed programmatically. However, due to how GBIF manages their data request, we elected to store the downloaded data on our secure Google Cloud Storage and to retrieve them programmatically from there rather than directly on GBIF.

3.5.2 Processing Steps

The harvested data was processed using the dwn_gbif and prc_gbif scripts. The following steps were performed:

  1. Data Download:
    • A query for specified taxonomic groups and geographic boundaries was submitted to the GBIF API using the rgbif package.
  2. Data Extraction:
    • The compressed ZIP file was extracted into a temporary directory.
    • The occurrence data was read from occurrence.txt using the vroom::vroom function.
  3. Data Cleaning and Transformation:
    • Selected relevant columns: species, year, month, day, eventDate, decimalLatitude, decimalLongitude, and lifeStage.
    • Converted the cleaned data to a spatial data frame using the sf::st_as_sf function with geographic coordinates (CRS 4326).
  4. Export:
    • The cleaned and spatially enabled dataset was exported as species_occurrences_gbif.gpkg in GeoPackage format.

3.5.3 Processed Data Structure

The structure of the processed dataset (species_occurrences_gbif.gpkg) is as follows:

Table 3.9: Structure of the processed GBIF occurrence dataset
Column_Name Description
species Scientific name of the species.
year Year of the recorded occurrence.
month Month of the recorded occurrence.
day Day of the recorded occurrence.
eventDate Full date and time of the recorded occurrence.
lifeStage Life stage of the species at the time of the occurrence (if available).
geometry Point geometry of the occurrence in geographic coordinates (CRS 4326).

3.6 National Hydro Network

3.6.1 Source and Description

The dataset, titled National Hydro Network (NHN) GeoBase Series, was retrieved from Natural Resources Canada (Canada, 2022c). This dataset provides a comprehensive geometric description and a set of basic attributes describing Canada’s inland surface waters. The data includes lakes, reservoirs, rivers, canals, drainage networks, and associated features.

3.6.2 Processing Steps

The NHN data was processed using the prc_national_hydro_network script. The following steps were performed:

  1. Data Extraction:
    • The compressed GeoPackage dataset (rhn_nhn_decoupage.gpkg.zip) was unzipped into a temporary directory.
  2. Watershed Data Processing:
    • The GeoPackage file (rhn_nhn_decoupage.gpkg) was read using the sf::st_read function.
    • All geometries were cast to polygons using sf::st_cast("GEOMETRYCOLLECTION") and sf::st_collection_extract("POLYGON").
    • The geometries were simplified using sf::st_simplify(dTolerance = 100) to reduce complexity while maintaining accuracy.
    • A unique watershed_id was generated for each feature using the dplyr::mutate function.
    • Only the watershed_id and geometry columns were retained.
  3. Export:
    • The processed watershed data was exported as watersheds.gpkg in GeoPackage format.

3.6.3 Processed Data Structure

The structure of the processed dataset (watersheds.gpkg) is as follows:

Table 3.10: Structure of the processed watershed dataset
Column_Name Description
watershed_id Unique identifier for each watershed feature.
geom Polygon geometry of the watershed in geographic coordinates (CRS 4326).

3.7 Ontario Freshwater Fishes Life History Database

The Ontario Freshwater Fishes Life History Database (Eakins, 2024) provides comprehensive life history information for 161 freshwater fish species in Ontario, Canada. The dataset includes 43 characteristics per species, covering taxonomic, ecological, reproductive, and habitat-related information. It also provides a bibliography of references used to compile this data.

3.7.1 Source and Accessibility

  • Source: Ontario Freshwater Fishes Life History Database
  • Accessibility: Open Government Licence - Canada
  • Data Type: HTML Scraped Dataset
  • Coverage: Freshwater systems in Ontario, Canada
  • Geographic Coverage: Ontario (Bounding Box: -95.1539, 41.6770, -74.3435, 56.8595)
  • Temporal Coverage: 2024
  • Processing Script: dwn_ontario_freshwater_fishes_life_history.R
  • Output Files: ontario_fishes_characteristics.csv, ontario_fishes_references.csv

3.7.2 Processing Steps

Data was harvested from the website by scraping individual fish detail pages for 161 species. The harvesting script used the rvest package in R to extract structured data for each species, including:

  1. Characteristics Extraction: Scraped 43 life history attributes such as family, common name, habitat preference, spawning season, thermal regime, and abundance.
  2. References Compilation: Extracted all bibliographic references associated with each species.
  3. Data Cleaning: Cleaned and pivoted the raw data into a tidy format using tidyr::pivot_wider() and janitor::clean_names().
  4. Export: Saved the processed data into two CSV files:
    • ontario_fishes_characteristics.csv: A structured table of life history characteristics.
    • ontario_fishes_references.csv: A table of species-referenced bibliographies.

3.7.3 Processed Data Structure

The structure of the processed data includes the following key variables:

Table 3.11: Structure of the Ontario Fishes Characteristics Table
Field Description
species_name Scientific name of the species.
family Taxonomic family, including common name.
species Species name, including genus.
taxonomic_authority Author of the taxonomic classification.
common_name_s Common name in English.
french_name Common name in French.
ontario_origin Origin status in Ontario (e.g., native, introduced).
general_abundance General abundance in freshwater systems.
thermal_regime Thermal regime preference (coldwater, coolwater, warmwater).
habitat_preference Preferred habitats for the species.
spawning_season Season of spawning activity.
fecundity Number of eggs produced by the species.
adult_length_cm Length of adult individuals in centimeters.
maximum_length_cm Maximum recorded length of the species in centimeters.
lifespan_yrs Expected lifespan in years.
Table 3.12: Structure of the Ontario Fishes References Table
Field Description
species Scientific name of the species associated with the reference.
reference Reference citation for the data.

3.8 FishPass Database

The FishPass Database (Benoit et al., 2023) provides a comprehensive collection of biological attributes that influence fish movement and passage, with a particular focus on species from the Laurentian Great Lakes. This dataset is essential for designing selective fish passage systems that aim to balance connectivity while managing the spread of invasive species. It includes 21 biological attributes, covering phenology, morphology, physiology, and behavioral characteristics for 220 fish species. Data coverage varies across species and attributes, highlighting gaps in knowledge about behavioral traits and potential invasive species.

3.8.1 Source and Accessibility

  • Source: Dryad Repository
  • DOI: 10.5061/dryad.fqz612jwj
  • Accessibility: CC0 Public Domain Dedication
  • Data Type: Tabular CSV files
  • Coverage: Laurentian Great Lakes, North America
  • Processing Script: prc_fishpass.R
  • Output Files:
    • fishpass_behaviour.csv
    • fishpass_morphology.csv
    • fishpass_phenology.csv
    • fishpass_physiology.csv

3.8.2 Processing Steps

  1. Data Cleaning:
    • Column names were standardized using janitor::clean_names().
    • Missing or malformed data were handled appropriately.
  2. Data Structuring:
    • Each CSV file was parsed into tidy data tables, retaining only relevant fields.
    • Attributes such as migratory status, body morphology, spawning season, and vertical stationing were included.
  3. Export:
    • Cleaned data tables were saved into separate CSV files for each attribute dimension, enabling modular analysis.

3.8.3 Processed Data Structure

The cleaned dataset is divided into four tables, each containing data for 220 species:

Table 3.13: Structure of the FishPass Behaviour Data Table
Field Description
order Taxonomic order of the species.
family Taxonomic family of the species.
genus Genus of the species.
scientific_name Scientific name of the species.
common_name Common name of the species.
vertical_station Vertical stationing behavior (e.g., demersal, pelagic).
schooling_behaviour Tendency for schooling behavior (e.g., schooling, non-schooling).
reference_vs Reference for vertical station data.
reference_sb Reference for schooling behavior data.
Table 3.14: Structure of the FishPass Morphology Data Table
Field Description
order Taxonomic order of the species.
family Taxonomic family of the species.
genus Genus of the species.
scientific_name Scientific name of the species.
common_name Common name of the species.
maximum_total_length_cm Maximum recorded total length (cm).
body_shape Overall body shape (e.g., fusiform, elongated).
aspect_ratio Aspect ratio of the fins.
eye_size_percent_hl Eye size as a percentage of head length.
reference_mtl Reference for maximum total length data.
Table 3.15: Structure of the FishPass Phenology Data Table
Field Description
order Taxonomic order of the species.
family Taxonomic family of the species.
genus Genus of the species.
scientific_name Scientific name of the species.
common_name Common name of the species.
migratory_status Whether the species is migratory or non-migratory.
spatial_scale_of_movement Scale of spatial movement (e.g., diadromous, potamodromous).
spawning_frequency Frequency of spawning (e.g., iteroparous).
spring_spawner Indicates if the species spawns in spring (binary).
reference_sf Reference for spawning frequency data.
Table 3.16: Structure of the FishPass Physiology Data Table
Field Description
order Taxonomic order of the species.
family Taxonomic family of the species.
genus Genus of the species.
scientific_name Scientific name of the species.
common_name Common name of the species.
climbing_ability Ability to climb barriers (binary).
hearing_specialization Hearing specialization (binary).
trophic_level Trophic level of the species.
presence_of_ampullary_electroreceptors Presence of ampullary electroreceptors (binary).
reference_ca Reference for climbing ability data.

3.9 North American Freshwater Migratory Fish Database (NAFMFD)

The North American Freshwater Migratory Fish Database (NAFMFD) (Dean et al., 2021, 2022) synthesizes comprehensive data on the migratory behavior of freshwater fishes across Canada, the United States, and Mexico. It includes information for 1,241 species spanning 79 families and 322 genera, characterizing migratory status, patterns, and behaviors.

3.9.1 Source and Accessibility

  • Source: U.S. Geological Survey ScienceBase-Catalog
  • DOI: 10.5066/P9WDLLP0
  • Accessibility: Public Domain
  • Data Type: Excel Spreadsheet and Metadata (XML)
  • Coverage: North America
  • Processing Script: prc_north_american_freshwater_migratory_fish_database.R
  • Output Files:
    • north_american_freshwater_migratory_fish_database.csv

3.9.2 Processing Steps

The data was harvested from the ScienceBase-Catalog using two files: an Excel spreadsheet containing species information and a metadata XML file. The processing involved the following steps:

  1. Data Extraction:
    • The Excel file was read using the readxl package.
    • Column names were standardized using janitor::clean_names() for consistency.
  2. Data Cleaning:
    • Removed unnecessary whitespace and formatted data fields for analysis.
  3. Export:
    • The processed data was saved as a CSV file for ease of use in further analyses.

3.9.3 Processed Data Structure

The final processed dataset contains 28 fields and 2,198 rows. Key variables include species taxonomy, migratory behavior, and references for data sources.

Table 3.17: Structure of the NAFMFD Processed Data
Field Description
itis_family Integrated Taxonomic Information System (ITIS) identifier for the family.
family_name Taxonomic family name of the species.
scientific_name Scientific name of the species.
common_name Common name of the species.
migratory Indicator of whether the species is migratory (1 = yes, 0 = no).
anadromous Indicator of anadromous migratory behavior (1 = yes, 0 = no).
catadromous Indicator of catadromous migratory behavior (1 = yes, 0 = no).
potamodromous Indicator of potamodromous migratory behavior (1 = yes, 0 = no).
diadromous Indicator of diadromous migratory behavior (1 = yes, 0 = no).
suspected_migrant Indicator of suspected migratory status (1 = yes, 0 = no).
non_migratory Indicator of non-migratory status (1 = yes, 0 = no).
reference Reference for data source and assignment.

3.10 Roberge et al. (2002)

The Roberge et al. (2002) dataset (Roberge et al., 2002) summarizes associations between stream habitat characteristics and life history stages of 86 species and 13 additional subspecies/forms of freshwater fishes in British Columbia and Yukon. The dataset focuses on stream habitat requirements across four life stages: spawning, young-of-the-year, juvenile, and adult. It highlights significant gaps in knowledge for species groups such as green sturgeon, minnows, smelts, ciscos, suckers, sculpins, lampreys, and sticklebacks, emphasizing the need for further research on stream habitat requirements for these taxa.

3.10.1 Source and Accessibility

  • Source: Fisheries and Oceans Canada, ScienceBase GCS
  • Accessibility: Institutional License
  • Data Type: Tabular Text File (.txt), PDF Report (.pdf)
  • Coverage: British Columbia and Yukon, Canada
  • Processing Script: prc_roberge_2002.R
  • Output File: roberge.csv

3.10.2 Processing Steps

The dataset was retrieved manually from the report (Fs97-4-2611E.pdf) to create the raw data (Roberge2002.txt). The programmatic processing steps involved:

  1. Data Extraction:
    • Imported the tabular text file with read.delim().
    • Filled missing values for habitat and life stage fields using tidyr::fill().
  2. Data Restructuring:
    • Split the data by species groups.
    • Renamed columns to reflect meaningful variable names (e.g., LifeStage, Value).
    • Pivoted columns for different life stages (e.g., spawning, juvenile) into a long format using tidyr::pivot_longer().
  3. Data Cleaning:
    • Replaced malformed characters and standardized names using stringr functions.
    • Filtered rows with missing or invalid values.
  4. Export:
    • Saved the processed dataset as a CSV file (roberge.csv) for further analysis.

3.10.3 Processed Data Structure

The processed dataset contains 5,319 rows and six fields, summarizing habitat characteristics across different life stages for freshwater fish species.

Table 3.18: Structure of the Roberge Processed Data
Field Description
name Common name of the species.
scientificname Scientific name of the species.
MigrationStrategy Migration strategy (e.g., anadromous, potamodromous).
characteristics Stream habitat characteristics (e.g., depth, flow type).
LifeStage Life stage (e.g., Spawning, Juvenile).
Value Habitat value associated with the life stage.

3.11 Dahlke et al. (2020)

The Dahlke et al. (2020) (Dahlke et al., 2020b) compiles experimental and imputed thermal tolerance data, thermal safety margins, and responsiveness for various fish species and life stages. It supports the research on fish survival under warming conditions and size-dependent oxygen supply constraints (Dahlke et al., 2020a).

3.11.1 Source and Accessibility

  • Source: PANGAEA - Data Publisher for Earth & Environmental Science
  • DOI: 10.1594/PANGAEA.917796
  • Accessibility: CC BY 4.0
  • Data Type: Excel Spreadsheets
  • Coverage: Global
  • Processing Script: prc_dahlke_2020.R
  • Output Files:
    • experimental_imputed_tolerance.csv
    • thermal_safety_margins.csv
    • thermal_responsiveness.csv
    • thermal_tolerance.csv

3.11.2 Processing Steps

The dataset was harvested and processed from four Excel files hosted on the PANGAEA repository. Each file corresponds to a specific thermal-related aspect, and the processing involved the following steps:

  1. Loading Data:
    • Read data from Excel files (experimental_and_imputed_tolerance_data.xlsx, thermal_safety_margins.xlsx, thermal_responsiveness.xlsx, and thermal_tolerance.xlsx) using readxl.
  2. Cleaning and Restructuring:
    • Standardized column names using janitor::clean_names().
    • Pivoted wide-format sheets into long-format tables for easier integration.
    • Filled missing values and cleaned malformed characters.
  3. Data Integration:
    • Combined multiple sheets within a single Excel file to create unified datasets.
    • Merged species-specific attributes across life stages and realms.
  4. Exporting Processed Data:
    • Saved cleaned datasets as CSV files (vroom_write) for further analysis.

3.11.3 Processed Data Structure

Table 3.19: Structure of Experimental and Imputed Tolerance Data
Field Description
life_stage Life stage of the species (e.g., Spawners, Juvenile).
species Species name following FishBase taxonomy.
tmin_c Minimum temperature (°C) tolerance.
tmax_c Maximum temperature (°C) tolerance.
tmid_c Midpoint temperature (°C) between Tmin and Tmax.
trange_c Range of temperature tolerance (°C).
reference Citation for the data source.
Table 3.20: Structure of Thermal Responsiveness Data
Field Description
species Species name following FishBase taxonomy.
lifestage Life stage associated with the thermal response (e.g., Embryos, Juveniles).
response Type of response (e.g., Development Rate, Survival).
trange_c Range of temperature (°C) associated with the response.
tmid_c Midpoint temperature (°C) for the response.
reference Citation for the data source.
Table 3.21: Structure of Thermal Safety Margins Data
Field Description
species Species name following FishBase taxonomy.
realm Habitat realm of the species (e.g., Freshwater, Marine).
depth_spawners Spawning depth range in meters.
depths_embryos Embryo depth range in meters.
spawning_season Seasonality of spawning (e.g., MJJ for May–July).
Table 3.22: Structure of Thermal Tolerance Data
Field Description
thermal_tolerance Thermal tolerance parameter (e.g., Tmax, Tmin).
species_fish_base Species name following FishBase taxonomy.
latitude Geographic latitude of observation.
lifestage Life stage of the species (e.g., Spawner, Embryo).
tmax_c Maximum temperature (°C) tolerance.
tmin_c Minimum temperature (°C) tolerance.
trange_c Range of temperature tolerance (°C).
realm Habitat realm of the species (e.g., Marine, Freshwater).

3.12 FishBase

The FishBase dataset (Froese and Pauly, 2024) provides a comprehensive database of global fish species, covering taxonomic, ecological, and biological attributes. Using the rfishbase R package (Boettiger et al., 2012), this dataset was retrieved via API queries and includes species information, habitat details, growth patterns, diet, reproduction, and distribution data.

3.12.1 Source and Accessibility

  • Source: FishBase API via rfishbase
  • DOI: https://www.fishbase.org
  • Accessibility: FishBase Terms of Use
  • Data Type: API Queries
  • Coverage: Global
  • Processing Script: download_fishbase.R
  • Output Files:
    • species.csv
    • fecundity.csv
    • reproduc.csv
    • eggdev.csv
    • larvdyn.csv
    • fooditems.csv
    • ecology.csv
    • swimming.csv
    • spawning_traits.csv
    • spawning_phenology.csv
    • larvae_traits.csv
    • larvae_phenology.csv
    • larvaepresence_phenology.csv

3.12.2 Data retrieval

FishBase provides over 200 tables, all of which were summarily searched for relevance on the context of the current project. The final list of tables considered is available here in the following table:

Table Name Description
ecology Includes ecological traits such as habitat preference, migratory behavior, and feeding mode.
eggdev Details on egg development stages and environmental influences.
fecundity Contains information on fish fecundity, including egg production and reproductive output.
fooditems Provides detailed information on specific food items consumed by fish.
larvae Contains larval traits and early life stage characteristics.
larvaepresence Information on the presence of larvae in different locations.
larvdyn Provides data on larval dynamics, including movement and growth.
reproduc Provides reproductive characteristics, including spawning type and parental care.
spawning Details spawning behavior, including seasonality, habitat, and locations.
species Main table containing taxonomic and biological information on fish species.
swimming Provides information on fish swimming capabilities, including speed and behavior.

3.12.3 Processing Steps

The dataset was harvested using API queries to FishBase via rfishbase. The processing steps involved, but are not limited to:

  1. Fetching Data:
    • Extracted tables covering species, fecundity, reproduction, egg development, larval dynamics, food items, ecology, and swimming behavior.
    • Queried FishBase for spawning, larval phenology, and larval presence phenology.
  2. Data Cleaning & Standardization:
    • Standardized column names with janitor::clean_names().
    • Filtered species based on the reference list of freshwater fish species in Canada.
    • Removed redundant columns and transformed categorical variables for consistency.
  3. Data Structuring:
    • Merged multiple sources of reproductive data to create a unified reproduc.csv file.
    • Created long-format data for spawning and larval phenology to facilitate time-series analyses.
    • Consolidated habitat data for ecological profiling.
  4. Exporting Processed Data:
    • Saved all cleaned tables as CSV files for further analyses.

3.12.4 Individual Tables

3.12.4.1 Ecology (ecology)

3.12.4.1.1 Explanation of Data Content

The Ecology Table in FishBase reference provides information on species habitat preferences, depth range, trophic levels, and associations with different aquatic environments.

  • Trophic Levels (diet_troph, food_troph):
    • diet_troph: The trophic level of the species based on its diet.
    • food_troph: The trophic level of the species based on observed food intake.
  • Habitat Zones:
    • Marine and Coastal Zones: Includes neritic, supra_littoral_zone, saltmarshes, littoral_zone, tide_pools, intertidal, sub_littoral, oceanic, epipelagic, mesopelagic, bathypelagic, abyssopelagic, and hadopelagic.
    • Freshwater and Transitional Habitats: Includes estuaries, mangroves, marshes_swamps, stream, lakes, and caves.
  • Substrate Preferences:
    • Different substrate types are recorded, such as soft_bottom, sand, coarse, fine, level, sloping, silt, mud, ooze, detritus, organic, hard_bottom, rocky, rubble, gravel, vegetation, and driftwood.

These fields allow researchers to analyze species distributions across environments, their interactions within ecosystems, and their adaptability to changing conditions.

3.12.4.1.2 Processing

The ecology.csv file was generated using the following processing steps:

  1. Selection of Relevant Fields:
    • Chose species-related ecological fields, including habitat zones, trophic levels, and substrate preferences.
  2. Data Transformation:
    • Converted missing values (NA) to 0 where appropriate for habitat presence/absence.
    • Replaced -1 values with 1 to indicate presence in binary habitat fields.
    • Applied dplyr::mutate() to ensure correct encoding of categorical variables.
  3. Filtering & Summarization:
    • Filtered rows where all habitat presence indicators were 0 to remove non-relevant entries.
    • Used dplyr::group_by(spec_code) and dplyr::summarise() to compute mean trophic levels and aggregate habitat presence across multiple observations.
  4. Exporting Cleaned Data:
    • The final dataset was written to ecology.csv for downstream analyses.
3.12.4.1.3 Processed Data Structure
Field Description
species_id Unique species identifier
diet_troph Trophic level based on diet
food_troph Trophic level based on food intake
neritic Presence in neritic zones (binary)
supra_littoral_zone Presence in supra-littoral zones (binary)
saltmarshes Presence in saltmarshes (binary)
littoral_zone Presence in littoral zones (binary)
tide_pools Presence in tide pools (binary)
intertidal Presence in intertidal zones (binary)
sub_littoral Presence in sub-littoral zones (binary)
caves Presence in caves (binary)
oceanic Presence in oceanic zones (binary)
epipelagic Presence in epipelagic zones (binary)
mesopelagic Presence in mesopelagic zones (binary)
bathypelagic Presence in bathypelagic zones (binary)
abyssopelagic Presence in abyssopelagic zones (binary)
hadopelagic Presence in hadopelagic zones (binary)
estuaries Presence in estuaries (binary)
mangroves Presence in mangroves (binary)
marshes_swamps Presence in marshes and swamps (binary)
cave_anchialine Presence in anchialine caves (binary)
stream Presence in streams (binary)
lakes Presence in lakes (binary)
cave Presence in caves (binary)
cave2 Additional cave habitat information (binary)
soft_bottom Preference for soft bottom substrates (binary)
sand Preference for sand substrates (binary)
coarse Preference for coarse substrates (binary)
fine Preference for fine substrates (binary)
level Preference for level substrates (binary)
sloping Preference for sloping substrates (binary)
silt Preference for silt substrates (binary)
mud Preference for mud substrates (binary)
ooze Preference for ooze substrates (binary)
detritus Presence of detritus (binary)
organic Presence of organic matter (binary)
hard_bottom Preference for hard bottom substrates (binary)
rocky Preference for rocky habitats (binary)
rubble Preference for rubble habitats (binary)
gravel Preference for gravel habitats (binary)
vegetation Presence in vegetated habitats (binary)
driftwood Presence near driftwood (binary)
fb_table FishBase source table indicator

3.12.4.2 Egg Development (eggdev)

3.12.4.2.1 Explanation of Data Content

The Egg Development Table in FishBase reference provides details on the environmental conditions required for fish eggs to develop.

  • Temperature Requirements (temperature):
    • Represents the mean recorded temperature at which eggs develop.
    • Understanding thermal tolerance at the egg stage is essential for studying species adaptations to different climatic conditions.
  • Salinity Preferences (freshwater, brackish):
    • Freshwater (freshwater): Indicates whether a species’ eggs develop in freshwater conditions.
    • Brackish (brackish): Indicates whether eggs can develop in brackish environments.
    • These values help identify species that can tolerate a range of salinities, which is useful for habitat conservation and management.
3.12.4.2.2 Processing

The eggdev.csv file was generated using the following processing steps:

  1. Selection of Relevant Fields:
    • Extracted species-related egg development fields, including temperature and salinity preferences.
  2. Data Transformation:
    • Computed the mean temperature per species using dplyr::group_by() and summarize().
    • Converted missing values (NA) to 0 where appropriate for salinity preferences.
    • Restructured the salinity data using tidyr::pivot_wider() to create binary presence/absence indicators for freshwater and brackish environments.
  3. Filtering & Summarization:
    • Removed species with no recorded temperature or salinity values.
    • Aggregated multiple records per species to avoid redundancy.
  4. Exporting Cleaned Data:
    • The final dataset was written to eggdev.csv for further analyses.
3.12.4.2.3 Processed Data Structure
Field Description
species_id Unique species identifier
temperature Mean recorded temperature during egg development
freshwater Presence in freshwater environments (binary)
brackish Presence in brackish environments (binary)
fb_table FishBase source table indicator

3.12.4.3 Fecundity (fecundity)

3.12.4.3.1 Explanation of Data Content

The Fecundity Table in FishBase provides information on the reproductive capacity of fish species, specifically focusing on the number of eggs produced per spawning event.

  • Fecundity Estimates (fecundity_min, fecundity_max):
    • Represents the estimated range of eggs produced per spawning event.
    • fecundity_min: The lowest recorded number of eggs produced.
    • fecundity_max: The highest recorded number of eggs produced.
    • These values help assess species reproductive potential and variability in spawning output.
  • Fecundity Factors:
    • While not included in this dataset, fecundity in FishBase is often linked to factors like body size, reproductive strategy, and environmental conditions.
3.12.4.3.2 Processing

The fecundity.csv file was generated using the following processing steps:

  1. Selection of Relevant Fields:
    • Extracted species-related fecundity fields, including minimum and maximum fecundity estimates.
  2. Data Transformation:
    • Computed the mean fecundity values for species with multiple records using dplyr::group_by() and summarize().
    • Ensured missing values (NA) were excluded from the mean calculations.
  3. Filtering & Summarization:
    • Removed species with no available fecundity data to retain only informative records.
  4. Exporting Cleaned Data:
    • The final dataset was written to fecundity.csv for further analyses.
3.12.4.3.3 Processed Data Structure
Field Description
species_id Unique species identifier
fecundity_min Mean minimum recorded fecundity
fecundity_max Mean maximum recorded fecundity
fb_table FishBase source table indicator

3.12.4.4 Food Items (fooditems)

3.12.4.4.1 Explanation of Data Content

The Food Items Table in FishBase reference provides insights into the dietary composition of fish species, documenting the type of prey consumed and the life stages at which they are eaten. This dataset helps in understanding feeding ecology, trophic interactions, and the role of fish species in food webs.

  • Primary, Secondary, and Tertiary Food Items (food_i, food_ii, food_iii):
    • food_i: Broad food category such as detritus, nekton, or plankton.
    • food_ii: More specific classification within food_i (e.g., finfish under nekton, cephalopods under mollusks).
    • food_iii: The most specific food category, detailing exact prey type (e.g., squids/cuttlefish, bony fish, carcasses).
  • Prey and Predator Life Stages (prey_stage, predator_stage):
    • Specifies the developmental stages of both prey and predator (e.g., larval, juvenile, adult).
    • Indicates at what stage of life a fish species consumes specific types of prey.
3.12.4.4.2 Processing

The fooditems.csv file was generated using the following processing steps:

  1. Selection of Relevant Fields:
    • Extracted dietary data related to fish feeding habits, including primary (food_i), secondary (food_ii), and tertiary (food_iii) food items.
    • Included details on prey and predator life stages (prey_stage, predator_stage).
    • Retained geospatial fields (country, longitude, latitude) where available.
  2. Data Transformation:
    • Standardized food category names for consistency.
    • Ensured missing values (NA) were preserved where information was unavailable.
  3. Geospatial Information Handling:
    • Included location-based data where available to associate feeding habits with geographic distribution.
    • Maintained missing values for species with unreported feeding locations.
  4. Exporting Cleaned Data:
    • The final dataset was written to fooditems.csv for further analyses.
3.12.4.4.3 Processed Data Structure
Field Description
species_id Unique species identifier
country Country where the food record was observed
longitude Longitude of the food observation
latitude Latitude of the food observation
food_i Primary food category (e.g., detritus, nekton, plankton)
food_ii Secondary food category (e.g., cephalopods, finfish)
food_iii Tertiary food category (e.g., squids/cuttlefish, bony fish)
prey_stage Life stage of the prey (e.g., larvae, juvenile, adult)
predator_stage Life stage of the predator (e.g., larvae, juvenile, adult)
fb_table FishBase source table indicator

3.12.4.5 Larvae (larvae)

3.12.4.5.1 Explanation of Data Content

The Larvae Table in FishBase reference contains detailed information about the early life stages of fish species, including their developmental traits and seasonal presence.

  • Larval Traits (larvae_traits):
    • Place of Development (placeof_development): Describes where larvae develop (e.g., planktonic, in a closed nest, on the substrate).
    • Larval Duration (larval_duration_min, larval_duration_max, larval_duration_mod): Duration of the larval stage in days, with minimum, maximum, and modal estimates.
    • Shape of Yolk Sac (shapeofyolksac): Describes the yolk sac morphology, which can influence larval survival.
    • Body Form (body_form): Indicates whether larvae are fusiform, elongated, or have another body shape.
  • Larval Phenology (larvae_phenology):
    • Locality (locality): Provides the geographic region where larval presence was recorded.
    • Monthly Presence (jandec): Indicates the months in which larvae were observed (1 = present, 0 = absent).
    • This data is essential for tracking seasonal spawning patterns and larval dispersal dynamics.
3.12.4.5.2 Processing

The larvae_traits.csv and larvae_phenology.csv files were generated using the following processing steps:

  1. Selection of Relevant Fields:
    • For larvae_traits.csv: Selected species-related developmental traits, including duration, yolk sac shape, and body form.
    • For larvae_phenology.csv: Selected locality and monthly larval presence indicators.
  2. Data Transformation:
    • Standardized categorical variables such as body_form to ensure consistency (e.g., converting “eel-like” to “elongated”).
    • Converted missing values (NA) to 0 for monthly larval presence data.
    • Used dplyr::mutate() to standardize habitat descriptors.
  3. Filtering & Summarization:
    • Removed entries where no larval traits were recorded.
    • Ensured only species with confirmed larval presence were retained in larvae_phenology.csv.
  4. Exporting Cleaned Data:
    • The final datasets were written to larvae_traits.csv and larvae_phenology.csv for further analyses.
3.12.4.5.3 Processed Data Structure
3.12.4.5.3.1 Larval Traits (larvae_traits.csv)
Field Description
species_id Unique species identifier
placeof_development Location of larval development (e.g., planktonic)
larval_duration_min Minimum larval duration (days)
larval_duration_max Maximum larval duration (days)
larval_duration_mod Modal larval duration (days)
shapeofyolksac Shape of yolk sac
body_form Larval body form (e.g., fusiform, elongated)
fb_table FishBase source table indicator
3.12.4.5.3.2 Larval Phenology (larvae_phenology.csv)
Field Description
species_id Unique species identifier
locality Geographic location of larval observations
jan-dec Monthly indicators of larval presence (1 = present, 0 = absent)
fb_table FishBase source table indicator

3.12.4.6 Larval Presence (larvaepresence)

3.12.4.6.1 Explanation of Data Content

The Larval Presence Table in FishBase provides information on the geographic and seasonal occurrence of fish larvae.

  • Monthly Presence (jandec):
    • Indicates the months in which larvae were observed at a given location.
    • Values are binary (1 = present, 0 = absent), providing insight into seasonal larval distributions.
3.12.4.6.2 Processing

The larvaepresence_phenology.csv file was generated using the following processing steps:

  1. Selection of Relevant Fields:
    • Retained species identification, geographic location, and monthly larval presence indicators.
  2. Data Transformation:
    • Converted missing values (NA) to 0, indicating no recorded presence.
    • Replaced occurrences of 111 with 1, ensuring correct binary encoding.
    • Used dplyr::mutate() to standardize monthly presence values.
  3. Filtering & Summarization:
    • Removed species without any recorded larval presence.
    • Ensured that at least one month contained larval presence data before inclusion in the final dataset.
  4. Exporting Cleaned Data:
    • The final dataset was written to larvaepresence_phenology.csv for further analyses.
3.12.4.6.3 Processed Data Structure
Field Description
species_id Unique species identifier
country Country where the larval presence was observed
longitude Longitude of the observation site
latitude Latitude of the observation site
jan-dec Monthly indicators of larval presence (1 = present, 0 = absent)
fb_table FishBase source table indicator

3.12.4.7 Larval Dynamics (larvdyn)

3.12.4.7.1 Explanation of Data Content

The Larval Dynamics Table in FishBase reference provides data on the environmental conditions affecting fish larvae, including ecosystem type, water temperature, and larval duration.

  • Ecosystem (ecosystem):
    • Describes the habitat type where larval development occurs (e.g., Shelf, Freshwater, Marine).
    • Helps in assessing species adaptation to different environmental conditions.
  • Temperature (temperature):
    • Represents the recorded water temperature (°C) where larvae develop.
    • Critical for studying species-specific thermal tolerances and climate change impacts.
  • Larval Duration (duration):
    • Indicates the estimated number of days larvae remain in the larval stage.
    • Longer durations suggest extended planktonic phases, which can affect dispersal and survival.
3.12.4.7.2 Processing

The larvdyn.csv file was generated using the following processing steps:

  1. Selection of Relevant Fields:
    • Extracted ecosystem type, water temperature, and larval duration for each species.
  2. Data Transformation:
    • Ensured categorical values for ecosystem were standardized.
    • Retained missing values (NA) where temperature or duration was unreported.
    • Applied dplyr::mutate() to tag records with the FishBase table name.
  3. Filtering & Summarization:
    • Removed species without any recorded larval dynamics data.
    • Retained unique records per species to prevent duplication.
  4. Exporting Cleaned Data:
    • The final dataset was written to larvdyn.csv for further analyses.
3.12.4.7.3 Processed Data Structure
Field Description
species_id Unique species identifier
ecosystem Ecosystem where larval development occurs
temperature Water temperature (°C) at larval development
duration Duration of the larval stage (days)
fb_table FishBase source table indicator

3.12.4.8 Reproduction (reproduc)

3.12.4.8.1 Explanation of Data Content

The Reproduction Table in FishBase reference provides information on the reproductive strategies of fish species, including their fertilization method, spawning patterns, and parental care.

  • Reproductive Mode (repro_mode):
    • Describes how sexes are structured in a species.
    • Common values include dioecism (separate sexes), hermaphroditism, and unisex populations.
  • Fertilization Type (fertilization):
    • Indicates whether fertilization occurs externally (in water) or internally (within the body).
  • Mating System (mating_system):
    • Provides details on the mating strategy, though this field is often unreported.
  • Spawning Pattern (spawning):
    • Describes the seasonal and geographic variation in spawning behavior.
    • Some species spawn throughout the year, while others have distinct reproductive seasons.
  • Batch Spawning (batch_spawner):
    • 0 = Does not spawn in batches.
    • 1 = Produces eggs in multiple spawning events.
  • Reproductive Guild (rep_guild1, rep_guild2):
    • Classifies species based on their spawning and parental care behaviors.
    • Primary guild (rep_guild1): Nonguarders, guarders, bearers.
    • Secondary guild (rep_guild2): More specific categorization, such as substratum egg scatterers or brood hiders.
  • Parental Care (parental_care):
    • Indicates if a species provides care for its eggs or offspring.
    • Values include none, guarding, mouthbrooding, or live-bearing behaviors.
3.12.4.8.2 Processing

The reproduc.csv file was generated using the following processing steps:

  1. Selection of Relevant Fields:
    • Extracted key reproductive attributes such as reproductive mode, fertilization type, and parental care.
  2. Data Transformation:
    • Standardized categorical variables (repro_mode, fertilization, rep_guild1, rep_guild2, parental_care) to ensure consistency.
    • Ensured missing values (NA) remained where information was unreported.
  3. Filtering & Summarization:
    • Removed species without any recorded reproductive data.
  4. Exporting Cleaned Data:
    • The final dataset was written to reproduc.csv for further analyses.
3.12.4.8.3 Processed Data Structure
Field Description
species_id Unique species identifier
repro_mode Reproductive mode (e.g., dioecism, hermaphroditism)
fertilization Fertilization type (external or internal)
mating_system Mating strategy
spawning Spawning pattern (e.g., seasonal, year-round)
batch_spawner Indicator of batch spawning (1 = Yes, 0 = No)
rep_guild1 Primary reproductive guild classification
rep_guild2 Secondary reproductive guild classification
parental_care Type of parental care (e.g., none, mouthbrooding)
fb_table FishBase source table indicator

3.12.4.9 Spawning (spawning)

3.12.4.9.1 Explanation of Data Content

The Spawning Table in FishBase reference provides detailed information on the reproductive seasonality and environmental requirements of fish species.

  • Spawning Traits (spawning_traits):
    • Temperature Requirements (temp_low, temp_high): Indicates the range of temperatures (°C) at which spawning occurs.
    • Fecundity Estimates (fecundity_min, fecundity_max): Represents the estimated range of eggs produced per spawning event.
    • Spawning Cycles (spawning_cycles): Number of spawning events per year.
    • Spawning Habitats (coastal, lacustrine, riverine, estuarine): Indicates the type of environment where spawning takes place.
  • Spawning Phenology (spawning_phenology):
    • Geographic Information (country, longitude, latitude): Location details of where spawning was observed.
    • Monthly Spawning Presence (jandec): Binary indicators (1 = active spawning, 0 = no activity) showing seasonal trends in spawning behavior.
3.12.4.9.2 Processing

The spawning_traits.csv and spawning_phenology.csv files were generated using the following processing steps:

  1. Selection of Relevant Fields:
    • Extracted environmental and reproductive attributes for spawning_traits.csv.
    • Retained geographic and temporal indicators for spawning_phenology.csv.
  2. Data Transformation:
    • Computed mean temperature and fecundity values per species using dplyr::group_by() and summarize().
    • Converted missing values (NA) to 0 for monthly spawning indicators.
    • Standardized habitat types using tidyr::pivot_wider() to create binary presence/absence indicators.
  3. Filtering & Summarization:
    • Retained only species with at least one recorded spawning event.
    • Removed records where all monthly spawning indicators were 0.
  4. Exporting Cleaned Data:
    • The final datasets were written to spawning_traits.csv and spawning_phenology.csv for further analyses.
3.12.4.9.3 Processed Data Structure
3.12.4.9.3.1 Spawning Traits (spawning_traits.csv)
Field Description
species_id Unique species identifier
temp_low Minimum temperature (°C) at which spawning occurs
temp_high Maximum temperature (°C) at which spawning occurs
fecundity_min Minimum estimated fecundity
fecundity_max Maximum estimated fecundity
spawning_cycles Number of spawning events per year
coastal Presence in coastal spawning habitats (binary)
lacustrine Presence in lake spawning habitats (binary)
riverine Presence in river spawning habitats (binary)
estuarine Presence in estuarine spawning habitats (binary)
fb_table FishBase source table indicator
3.12.4.9.3.2 Spawning Phenology (spawning_phenology.csv)
Field Description
species_id Unique species identifier
country Country where spawning presence was observed
longitude Longitude of the observation site
latitude Latitude of the observation site
jan-dec Monthly indicators of spawning presence (1 = present, 0 = absent)
fb_table FishBase source table indicator

3.12.4.10 Species (species)

3.12.4.10.1 Explanation of Data Content

The Species Table in FishBase reference provides fundamental taxonomic and biological information for fish species.

  • Taxonomic Information:
    • Scientific Name (scientific): The genus and species combination used for taxonomic classification.
    • Preferred Image (pic_preferred_name): The filename of the representative species image.
  • Habitat and Distribution:
    • Freshwater (fresh), Brackish (brack), Saltwater (saltwater): Binary indicators (1 = present, 0 = absent) representing habitat preferences.
    • Pelagic or Demersal (demers_pelag): Classification indicating if the species is bottom-dwelling (demersal) or lives in the water column (pelagic).
    • Air Breathing (air_breathing): Indicates if the species is capable of air breathing.
    • Anadromy and Catadromy (ana_cat): Classifies migratory species that move between freshwater and marine environments.
  • Life History Traits:
    • Longevity in the Wild (longevity_wild): Maximum recorded lifespan (years).
    • Maximum Length (length): Maximum total length (cm) recorded for the species.
    • Common Length (common_length): The most frequently observed length (cm) for the species.
    • Maximum Weight (weight): Maximum recorded weight (grams) for the species.
  • Economic Importance:
    • Importance (importance): Classifies species as commercial, minor commercial, subsistence, or game fish.
    • Game Fish (game_fish): Binary indicator (1 = game fish, 0 = not targeted for sport fishing).
3.12.4.10.2 Processing

The species.csv file was generated using the following processing steps:

  1. Selection of Relevant Fields:
    • Retained taxonomic, habitat, and biological attributes essential for ecological and fisheries research.
  2. Data Transformation:
    • Standardized categorical variables (body_shape_i, demers_pelag, ana_cat, importance) for consistency.
    • Converted missing values (NA) to 0 for binary habitat fields (fresh, brack, saltwater).
  3. Filtering & Summarization:
    • Removed species lacking taxonomic or ecological data to maintain dataset integrity.
    • Mapped species to FishBase images where available.
  4. Exporting Cleaned Data:
    • The final dataset was written to species.csv for further analyses.
3.12.4.10.3 Processed Data Structure
Field Description
species_id Unique species identifier
scientific Genus and species combination
pic_preferred_name Filename of the representative species image
body_shape_i Body shape classification (e.g., fusiform)
fresh Presence in freshwater habitats (binary)
brack Presence in brackish water habitats (binary)
saltwater Presence in saltwater habitats (binary)
demers_pelag Classification as demersal or pelagic
air_breathing Indicator of air-breathing ability (binary)
ana_cat Migration category (e.g., anadromous, catadromous)
longevity_wild Maximum recorded lifespan (years)
length Maximum recorded length (cm)
common_length Most frequently observed length (cm)
weight Maximum recorded weight (grams)
importance Economic importance (e.g., commercial, game fish)
game_fish Indicator of game fish status (binary)
fb_table FishBase source table indicator

3.12.4.11 Swimming (swimming)

3.12.4.11.1 Explanation of Data Content

The Swimming Table in FishBase reference provides information on the swimming strategies of fish species.

  • Swimming Type (adult_type):
    • Describes the primary movement mechanism used by adult fish.
    • Common values include:
      • Movements of body and/or caudal fin (e.g., undulatory propulsion).
      • Movements of median and/or paired fins (e.g., fin propulsion without significant body movement).
  • Swimming Mode (adult_mode):
    • Classifies species based on their swimming kinematics.
    • Common categories:
      • Anguilliform (e.g., eels—body waves propagate the entire length).
      • Carangiform (e.g., jacks—flexion in the posterior half of the body).
      • Subcarangiform (e.g., salmon—less body flexion than carangiform).
      • Thunniform (e.g., tuna—efficient, high-speed swimming with minimal body movement).
      • Labriform (e.g., wrasses—use of pectoral fins for propulsion).
3.12.4.11.2 Processing

The swimming.csv file was generated using the following processing steps:

  1. Selection of Relevant Fields:
    • Extracted adult swimming type and mode to retain core locomotive attributes.
  2. Data Transformation:
    • Standardized categorical variables (adult_type, adult_mode) for consistency.
    • Ensured missing values (NA) remained where data was unreported.
  3. Filtering & Summarization:
    • Removed species without recorded swimming data.
  4. Exporting Cleaned Data:
    • The final dataset was written to swimming.csv for further analyses.
3.12.4.11.3 Processed Data Structure
Field Description
species_id Unique species identifier
adult_type Primary movement mechanism
adult_mode Swimming kinematics classification
fb_table FishBase source table indicator

3.13 References

Benoit, D., Zielinski, D., and Swanson, R. et al. (2023). FishPass sortable attribute database: Phenological, morphological, physiological, and behavioural characteristics related to passage and movement of great lakes fishes [dataset]. Available at: https://doi.org/10.5061/dryad.fqz612jwj.
Boettiger, C., Temple Lang, D., and Wainwright, P. (2012). Rfishbase: Exploring, manipulating and visualizing FishBase data from r. Journal of Fish Biology. Available at: https://doi.org/10.1111/j.1095-8649.2012.03464.x.
Canada, N. R. (2022a). Atlas of canada national scale data 1:1,000,000 - rivers. Natural Resources Canada Available at: https://open.canada.ca/data/en/dataset/87bb794c-eabf-5eea-bdb3-61d0338d9594.
Canada, N. R. (2022b). Atlas of canada national scale data 1:1,000,000 - waterbodies. Natural Resources Canada Available at: https://open.canada.ca/data/en/dataset/e9931fc7-034c-52ad-91c5-6c64d4ba0065.
Canada, N. R. (2022c). National hydro network (NHN) - GeoBase series. Natural Resources Canada Available at: https://open.canada.ca/data/en/dataset/a4b190fe-e090-4e6d-881e-b87956c07977.
Dahlke, F. T., Wohlrab, S., Butzin, M., and P"ortner, H.-O. (2020a). Fish survival at the upper thermal limits: Effects of size-dependent oxygen supply and climate change. Science 369, 65–70. doi:10.1126/science.aaz3658.
Dahlke, F., Wohlrab, S., Butzin, M., and P"ortner, H.-O. (2020b). Experimental data compilation, thermal tolerance and thermal responsiveness of fish species and life stages [dataset]. doi:10.1594/PANGAEA.917796.
Dean, E. M., Cooper, A. R., Wang, L., Daniel, W. M., David, S., Ernzen, C., et al. (2021). The north american freshwater migratory fish database (NAFMFD): Characterizing the migratory life histories of freshwater fishes of canada, the united states, and mexico. U.S. Geological Survey Available at: https://doi.org/10.5066/P9WDLLP0.
Dean, E. M., Cooper, A. R., Wang, L., Daniel, W., David, S., Ernzen, C., et al. (2022). The north american freshwater migratory fish database (NAFMFD): Characterizing the migratory life histories of freshwater fishes of canada, the united states and mexico. Journal of Biogeography, Early View. Available at: https://doi.org/10.1111/jbi.14367.
Eakins, R. J. (2024). Ontario freshwater fishes life history database. Ontario Freshwater Fishes Available at: https://www.ontariofishes.ca/home.htm.
Froese, R. and Pauly, D. (2024). FishBase freshwater species checklist. FishBase Available at: https://www.fishbase.se/country/CountryChecklist.php?what=list&trpp=50&c_code=124&csub_code=&cpresence=present&sortby=alpha2&vhabitat=fresh.
Froese, R., and Pauly, D. (Editors). (2024). FishBase. Available at: https://www.fishbase.org.
Global Administrative Areas, G. (Database of (2023). GADM database of global administrative areas, version 4.1. GADM Available at: https://gadm.org.
Global Biodiversity Information Facility (2025). GBIF occurrence download https://doi.org/10.15468/dl.fxqrjh accessed from r via rgbif (https://github.com/ropensci/rgbif) on 2025-01-23. GBIF Secretariat.
Roberge, M., Hume, J. M. B., Minns, C. K., and Slaney, T. (2002). Life history characteristics of freshwater fishes occurring in british columbia and the yukon, with major emphasis on stream habitat characteristics. Fisheries; Oceans Canada.
Tedesco, P. A. et al. (2017a). A global database on freshwater fish species occurrences in drainage basins. Figshare Available at: https://figshare.com/collections/A_global_database_on_freshwater_fish_species_occurrences_in_drainage_basins/3739145.
Tedesco, P. A., Beauchard, O., Bigorne, R., et al. (2017b). A global database on freshwater fish species occurrences in drainage basins. Scientific Data 4, 170141. doi:10.1038/sdata.2017.141.