Hydropedological digital mapping: machine learning applied to spectral VIS-IR and radiometric data dimensionality reduction

ABSTRACT Pedosphere-hydrosphere interface accounts for the association between soil hydrology and landscape, represented by topographic and Remote Sensing data support and integration. This study aimed to analyze different statistical radiometric and spectral data selection methods and dimensionality reduce environment-related data to support the classification of soil physical-hydric properties, such as soil basic infiltration rate (bir) and saturated hydraulic conductivity (Ksat); as well as to act in data mining processes applied to hydropedological properties digital mapping. Accordingly, research integrated information from Visible to Infrared (VIS-IR) spectral indices and Sentinel’s 2A mission Multispectral Instrument (MSI) sensor bands, terrain numerical modeling and aerogeophysics set to model soil-water content in two soil layers (0.00-0.20 m and 0.20-0.40 m). Pre-processed data were subjected to statistical analysis (multivariate and hypothesis tests); subsequently, the methods were applied (variation inflation factor - VIF, Stepwise Akaike information criterion – Stepwise AIC, and recursive feature elimination - RFE) to mine covariates used for Random Forest modeling. Based on the results, there were distinctions and singularities in spectral and radiometric data selection for each adopted method; the importance degree, and contribution of each one to soil physical-hydric properties have varied. According to the applied statistical metrics and decision-making criteria (highest R2 and lowest RMSE / MAE), the chosen methods were RFE (0.00-0.20 m layers) and Stepwise AIC (0.20-0.40 m layers) - both concerned with the assessed variables (bir and Ksat). This approach captured the importance of environmental variables and highlighted their potential use in hydropedological digital mapping at Guapi-Macacu watershed.


INTRODUCTION
Remote Sensing techniques significantly contributed to the understanding and supporting environmental phenomena on Earth's surface, especially in correct environmental resources management and sustainability studies (Brenner and Guasselli, 2015).The ability to collect information from different electromagnetic spectrum wavelengths (multiband) acquired by remote sensors is the main reason for this contribution, as it allows differentiating terrestrial targets and covering large areas (Brenner and Guasselli, 2015).Therefore, soil basic infiltration ratio (vib) and saturated hydraulic conductivity (Ksat) could be map and modeled according to spectral and topographic relationships extracted from products obtained from remote sensing.
Also, a significant advancement in the environmental geophysics field is notable, which focuses on hydrogeological research accounting for combining airborne data gathered by geophysical sensors with field data related to soil and rocks lithology, mineralogy, physics, and chemistry.These studies can assess and model water resources distribution in groundwater reservoirs in sedimentary watersheds (Novakowski et al., 2006;Madrucci et al., 2008;Kirsch, 2009;Lee et al., 2012;Lee and Lee, 2015;Pires and Miranda, 2017).
However, there's still a lack of studies concerning geophysical maps potential integration in water favorability (determining areas presenting the greatest groundwater occurrence potential) and in hydropedological dynamics, as well as research treating such data as input variables in Digital Soil Mapping (DSM).This scenario has inspired the search for an in-depth evaluation of the potential of this tool in soil-water modeling studies.
The literature brings several references about topographic attributes selection and classification to map soil properties.These studies follow geomorphometric Digital Elevation Model (DEM) covariates to depict soil/landscape ratio in a given study site (McKenzie and Austin, 1993;Wilson and Gallant, 2000;Böhner and Selige, 2006;Oliveira et al., 2017;Santos et al., 2019).Nowadays, digital soil mapping applied to soil properties remains mainly based on soil morphometric parameters; however, given the advancements in remote sensing applied to spatial-spectral resolutions and data availability, adopting indices based on the spectral bands association in predictive spatial modeling techniques, such as machine learning, has become a common reality in the target variables quantification (Chagas, 2006;Pinheiro, 2012;Cunha, 2013;Zhang et al., 2017).
Accordingly, although remote sensing using in digital modeling has grown, it is also possible that spectral indices' individual influence remains poorly explored when it comes to soil physical-hydric attributes modeling.Studies have focused on soil physical-chemical properties mapping, climatic regionalization and forest species differentiation based on spectral indices (Carvalho Junior et al., 2011;Pinheiro et al., 2019;Rajah et al., 2019).
Consistent data collection and integration are fundamental requirements for modeling aiming to reflect a representative assessment of the water resources state (Baalousha, 2010).However, the crucial point remains in treating these data to transform them into reliable information.The dynamic attributes of spatial variability related to physicalhydraulic aspects are generally controlled by various properties and variables in soils, mainly conditioned by the natural landscape (excluding atypical or extreme landscape alterations), with different effects depending on the analyzed depth (Manzione and Castrignanò, 2019).
Linked to this, factors such as difficulty in measuring dynamic physical-hydric variables in the field (recognition of the area; high costs in carrying out field campaigns; team displacement in large areas for data coverage; sluggishness in sample collections in one or more depth levels), the collected data absence for these variables in Brazil traditional technical soils surveys (hydropedological tests non-performance in the field and laboratory analysis) and restrictions naturally imposed by the study region (accessing Rev Bras Cienc Solo 2023;47:e0220149 regions infeasibility with large slopes and areas with dense forest), limit these properties mapping (Ottoni, 2005;Oliveira et al., 2017).
Considering that soil physical-hydric parameters are influenced by genetic soil properties associated with morphology, texture, and soil physics such as: particle size and aggregation degree, total porosity, soil density, soil surface cover type, profile moisture, organic matter amount, among others (Reichert et al., 1992;Everts and Kanwar, 1993;Bertoni et al., 2017), and also due to study area spatial variability (Klar, 1984); evaluate the potential spectral and radiometric data inputs that contribute to these properties quantification in digital pre-modeling can make the mapping process more robust and reliable in the study area.
Few studies in Brazil focus on modeling the surface and subsurface behavior of water in the soil, particularly considering variable pre-selection protocols and data dimensionality reduction.Most current studies explore statistical methods and data analysis techniques for modeling physical-hydric properties, but they do not significantly explore the input variables of these models (Granata et al., 2022;Yamaç et al., 2022).Furthermore, they rely on established knowledge of soil morphology properties (such as texture, porosity, soil density, and particle size) as the main input variables in the modeling process (McKeague et al., 1982;Chapuis, 2012).
Studies like Manzione and Castrignanò (2019), Granata et al. (2022) and Yamaç et al. (2022), on the other hand, focus on exploring models based on regressive analysis, geostatistics and machine learning using Ksat measurement best practices.However, they do not discuss the importance of variable pre-selection in the pre-modeling phase and fail to identify potential interrelated attributes that could be used to predict physicalhydraulic attributes beyond the previously mentioned soil properties.McBratney et al. (2003) suggest, among other topics, studies on potential environmental covariates for applications in DSM as an open topic for further discussions within the academic community.
On the contrary, certain studies, such as Fathololoumi et al. (2021), discuss a variable reduction in environmental covariates utilized in a soil moisture prediction Digital Soil Mapping (DSM) approach, incorporating satellite images and morphological covariates.However, this study solely relies on autocorrelation and collinearity analysis for variable reduction.It fails to explore model protocols within DSM to pre-select these variables (for instance, statistical hypothesis test applied in data set modeled by regression models like the cubist model used) or provide justification for their inclusion (previous environmental covariates analysis to understand all potential data that can be used to represent the area dynamics, not only the common explored ones in DSM).It is worth noting that the only topic addressed in the study is the feature importance obtained throughout the modeling process and predictive model performance step.
Adequate variability assessment of these properties can effectively aid in understanding the processes that cause attributes spatial variation, enabling proper water resources management in river basins and correct soil management.However, Atkinson and Tate (2000) and Gotway and Young (2002) infer that many statistical problems arise when integrating different data obtained at different scales and characterized by different support and uncertainty.
Conducting studies to define the most important input variables to represent soil physicalhydric properties variability and to rule out modeling issues associated with multicollinearity inflation and autocorrelation in data sets is essential, especially those models based on regressive analyses-based models to integrate physical-hydric attributes, in a robust and refined way (O'Hagan and McCabe, 1975;Andrews, 1991;Carvalho et al., 1999;Mihola and Bílková, 2014).Thus, some methods can optimize input variables selection and reduce dimensionality in the assessed data set, ensuring robustness and refinement in the modeling process.
Considering the Guapi-Macacu watershed as the study area target, the knowledge developed in these Digital Soil Mapping (DSM) stages enables sustainable management of water resources in the Rio de Janeiro State.This approach prevents issues such as aquifer depletion, reduction of groundwater levels and riverbeds, biodiversity loss, and negative impacts on agriculture due to water scarcity or reduced water availability.
Inspired by such a scenario, the present study aims to: (1) analyze different statistical methods based on multivariate analysis applied to select and reduce dimensionality, using topographic and remote sensing data related to vegetation, soil, and geology; and (2) map soil water properties spatial variability in Guapi-Macacu watershed in Rio de Janeiro, Brazil, based on Random Forest (RF) classifier and on using the most relevant variables selected in the previous data mining stage.

Study area
Guapimirim-Macacu river basin was selected for the current study, which is located within Guanabara Bay Hydrographic Region V (HR-V) domain in Rio de Janeiro State Metropolitan Region (Figure 1).It encompasses Itaboraí, Guapimirim and Cachoeiras de Macacu counties political-managerial limits.The basin has 1,250.78km² of capturearea and 199.2 km of perimeter.Its relief is featured by elevation ranging 0-2,254 m, which was assessed through the region's Digital Elevation Model (DEM), at 20 m spatial resolution (Figure 1).
The prevailing climate in the region is rainy tropical type, with dry winter.Mean annual temperature is close to 23 °C and mean annual rainfall ranges from 1,200 to 2,600 mm due to mountain buttresses (HWA et al., 2010).The region stands out for housing the Atlantic Forest biome, with tropical forest fractions and Mar de Morros environment vegetation characteristic with transition to Coastal Lowland (Pinheiro, 2015).
Regions geology is featured by Gráben da Guanabara: an elongated valley with plane bottom, with geological faults forming a sequence of sedimentary decomposition caused by the tectonic activity that had started in the Tertiary, at Macau Sedimentary Basin formation (Ferrari, 2001).Depositional events that have taken place in the region determine its geomorphological features, presenting alluvial, river and lacustrine features (Pinheiro, 2012).
Region soils present sedimentary deposition environment features given the local geomorphology's lacustrine and river influence.The basin is surrounded by valleys and mountains due to Mar de Morros environment's transition to high-altitude fields, reaching altitude up to 2,254 m, and leading to soils presenting great pedogenetic variability and taxonomic diversity such as: Ultisols, Cambisols, Gelisols, Oxisols and Neosols (Pinheiro, 2012).

Macacu database and hydropedological tests
The pedological database provides the collected samples morphological descriptions and physical-chemical analyses.They are an important part of the study since it focuses on featuring the basins soil.Data were collected in the basin area based on the Conditioned Latin Hypercube Sample (cLHS) technique.This technique allows using computer resources to infer the selected points in the area.These resources are highly representative of Guapi-Macacu River basin's environmental features (Mckay et al., 2000;Minasny and McBratney, 2006;Carvalho Júnior et al., 2014).
Rev Bras Cienc Solo 2023;47:e0220149 Soil basic infiltration rate (bir) data were collected through hydropedological survey campaigns (two in total, which occurred in September 2019 and February 2020), which were carried out in situ in the watershed; these campaigns used the Guelph permeameter, model 2800K1, by Soil Moisture, whose associated-measurement method was elaborated by Reynolds and Elrick (1985), and enhanced at University of Guelph, in Canada, back in 1985 (Elrick et al., 1989).The application of Darcy's equation's, and their evolutions (Darcy, 1856;Richards, 1931;Reynolds and Elrick, 1985) allowed finding the saturated water conductivity (Ksat) values, which were estimated based on bir by using formulas based on known parameters estimated in the laboratory.
The Ksat values can be transformed based on the measured bir data, according to the equipment's parameters (water column, water load and water column diameter), soil features (porosity, genesis and storage coefficient), on Darcy's equation and of their evolutions (Darcy, 1856;Richards, 1931;Reynolds and Elrick, 1985).In total, thirty-six (36) data values for both physical-hydric attributes were measured in the study site.
The measured data were separated into four levels, one of them regarding the soil water condition (hydromorphic and non-hydromorphic) and the other the layer analyzed (surface, ranging 0.00-0.20 m in depth, and subsurface, ranging 0.20-0.40m in depth).
Finally, data were subjected to pedotransfer functions calibration based on intrinsic soil features (particle size composition, soil density, water pH, water dispersed clay, porosity, particle density, organic carbon, and T-value).The remaining points (83 total) were estimated by pedotransfer functions applied in soils vertical modeling in the basin.This estimate resulted in 122 points with the surface (0.00-0.20 m) and subsurface (0.20-0.40 m) information about the analyzed properties (Figure 1).These points encompassed the analyzed data set in the current research.The previously described stages were conducted by the authors during analysis performed prior to the current study.

Terrain numerical modeling: covariables extraction and association with basin's landscape
SAGA GIS (Conrad et al., 2015), an Geographic Information System (GIS) focused on geo-scientific analysis, was used to plot topographic maps aimed at depicting surface morphometric parameters (primary and secondary covariables) deriving from hydrologically consisted digital elevation model of orthometric altitudes (altimetry) applicable to topographic surface aggregated to vector elements in the basin, such as vegetation cover.Vector data used to generate DEM resulted from Instituto Brasileiro de Geografia e Estatística (IBGE) Geosciences repository in partnership with Secretaria Estadual do Ambiente (SEA), constituting a continuous cartographic database of Rio de Janeiro State's planialtimetric survey, at 1:25,000 scale.
Basin's political-managerial region limit mask was acquire from geo-spatial data available at Instituto Estadual do Ambiente (GeoINEA) page for data clipping in the target areas, which were delimited based on compiled data from a series of topographic maps at 1:50,000 scale, provided by National Cartographic System (NCS).The basin's limit was also standardized based on datum and coordinate systems used to treat IBGE's vector data.The next step involved proceeding with the DEM extraction by using TopoToRaster interpolator in ESRI ArcGIS 10.6 environment (Redlands, 2011).This tool was designed to create digital elevation models with hydrological consistency.
Vector data were interpolated at 20 m spatial resolution to create regions DEM.Subsequently, a hydrologic consistency analysis based on Flow Accumulation, Flow Direction and Fill tools was applied over DEM, turning the model into a Hydrologically Consisted Digital Elevation Model (HCDEM).Accordingly, the assessed region's elevation map was obtained (Figure 1).
Terrain covariables were carefully selected to design the basin's landscape.This process considered pedological and landscape environmental elements (soil, vegetation, hydrology, and geology) observed in it.Radiometric matrix data and their features are shown in table 1.
The primary and secondary selected attributes (36 attributes) presented significant expressiveness in representing the landscape's environmental properties and active processes, and it has featured the soil formation factors, mainly relief.Therefore, the basin-representative covariables selection was a priority because it concerned the continuous surfaces extraction based on Hydrology, Lighting, Visibility, Morphometry, Slope Stability and Channels tools application, as well as on SAGA GIS Terrain Analysis modulus, which can quantify mountains eco-systemic variables (altitude fields) and region's native biome (Atlantic Forest biome).

Aerogeophysical data acquisition and processing of Sentinel-2A optical images
The Brazil's Geological Survey Geosciences system (GeoSGB), an online repository maintained by Companhia de Pesquisa de Recursos Minerais (CPRM), provided the magnetometry and gamma spectrometry aerogeophysical data used in the current study (CPRM, 2012).This system provides Aerogeophysical Survey Projects aerial bases.These projects are conducted by CPRM and other institutions, and the database holds airborne sensor data used to detect magnetometric, gamma spectrometric, gravimetric, and radiometric parameter values, among other data about Brazil national territory.
Raw aeromagnetometric and gamma spectrometric digital data were collected in GeoSGB aerogeophysical projects database, at 500 m distancing between flight lines.These data were processed in Geosoft Oasis Montaj software.Aeromagnetometric data were treated to correct common aerial survey and equipment issues (parallax error correction, diurnal variation removal, profile leveling and micro-levelling, and geomagnetic field trend surface definition called International Geomagnetic Reference Field IGRF).These procedures were performed by CPRM in its final processing report (CPRM, 2012).
Rev Bras Cienc Solo 2023;47:e0220149 Corrected tabular data were spatialized in software for georeferencing and for the Anomalous Magnetic Field's (AMF) radiometric bi-directional grid creation using tridimensional coordinates (x, y, z) and magnetometric value measured by IGRF.Values were interpolated at maximum of 1/5 of flight lines to avoid post-processing magnetometric signature losses (MagMap module).The lacking grid values were also filled through dummy interpolation based on square method (Grid and Image modulus).The anomalous magnetic field (AMF) map was generated by the radiometric grid, at spatial resolution of

S Geomorphons (Gm) VI
Landscape classification based on landforms; region's geomorphology description.
Reduction to pole (RTP) was simulated based on the derivative of axis x, by plotting the ASA map, because Rio de Janeiro State is in a low-altitude region.This simulation allowed to highlight the geologic profile limits (most superficial mining).Finally, spectral signature variation rate was expressed in shorter wavelength to scour smaller bodies (mineralogy and smaller rocks) represented in ASA map.Slope (-35.64),declination (-21.69) and the corrected amplitude (-54.36) were IGRF's parameters necessary to reduce aerial survey data (January 8, 2012).The two first parameters were automatically calculated by Oasis Montaj, based on aerial survey's data input.
As CPRM's recommendations (CPRM, 2012), the minimum curvature interpolator was used for referred gamma data to create Uranium (U), Thorium (Th) and Potassium (K) grids.Also, the interpolation was based on one-fifth of flight lines to equate with aeromagnetometric data's spatial resolution (100 m).Grid lacked information found was corrected based on the square method (Grid and Image modulus) through dummy interpolation.Finally, a radiometric ternary map was plotted in red, green, and blue composition (RGB) depending on radioelements bands (R-potassium, G-thorium, B-uranium) represented by the ternary triangle, using Oasis Montaj Grid and Image modulus.
Multispectral images from MSI/Sentinel-2 mission were selected based on data proximity and availability in Copernicus Sentinel Hub repository to meet the hydropedological data collection period (September 2019 and February 2020).The images acquisition dates were August 2, 2019 (for the first field survey) and February 10, 2020 (for the second field survey).The conducted treatment and processing were based on transforming radiance information into surface reflectance and allowing spatial resolution matching between sensor's bands by transforming the 10 m bands resolution into 20 m ones to make pixels compatible for band algebra performance (index calculations).The spectral bands used in analysis were Sentinel-2A mission B2 to B8A, B11 and B12 bands to identify possible relational patterns with the physical-hydric attributes (ESA, 2020).Spectral indices were calculated using visible-infrared (VIS-IR) bands.The spectral indices selection related to vegetation, soil and geology was carefully made, considering possible associations with the study variables, as shown in table 2.
All spectral data treatment and processing procedures were carried out through statistical routine implemented in RStudio environment (Rstudio Team, 2020), based on sen2R package used to process Sentinel-2A images.The indices were also calculated in this environment by using images from nine spectral bands (VIS-IR) from Sentinel 2A mission based on bands mathematics.

Modeling: selection methods and data dimensioning reduction, random forest, and validation
Initially, the input data underwent analysis using four statistical adjustment techniques: autocorrelation analysis, multi-collinearity analysis, principal components analysis, and hypothesis tests.These techniques were employed to apply methods, minimize errors, and avoid information loss.
Subsequently, statistical methods based on regression analysis, multivariate analysis, and machine learning were used to select and reduce the data dimensionality.Specifically, the Variation Inflation Factor (VIF), Stepwise Akaike Information Criterion (Stepwise AIC), and Recursive Feature Elimination (RFE) were applied.These analyses were implemented using routines in the RStudio environment.
Rev Bras Cienc Solo 2023;47:e0220149 The VIF method was used to address multicollinearity issues by quantifying the increase in variance of an estimated regression coefficient due to collinearity.This method consists of the quotient between a model's variance with several terms and the model's variance with a single term.It quantifies this variance severity based on ordinary least squares and provides an index to measure to which extent the variance (the square of the estimate's standard deviation) of an estimated regression coefficient increases due to collinearity (Daniel and Wood, 1999).
Stepwise AIC is an automatic and iterative method used to select variables based on regression analysis.It evaluates each variable's contribution to the model and determines whether it should be added or subtracted based on a pre-specified criterion (AIC).Based on a pre-specified criterion, AIC can interactively add or remove a variable from a set of explanatory variables.It is done by estimating the relative amount of information lost by a given model, in case an important variable is taken out of the analysis (Taddy, 2019;McElreath, 2020).The RFE is a selection method used in machine learning models and Data Mining universe.It aims to select predictor variables that best fit the desired model, whether regression, multivariate, or machine learning.The RFE optimizes the model by selecting the most relevant predictor variables (Blum and Langley, 1997;Bradley et al., 1998).Accordingly, RFE results in predictor variables selection to better optimize the model one wishes to set (Svetnik et al., 2004).; ; ; Segal (1982); Drury (1987); Rowan and Mars (2003) Grain Size Index (GSI) Brenner and Guasselli (2015) Non-Linear Index (NLI) Goel and Qin (1994) Soil Adjusted Vegetation Index (SAVI) 1.5 0.5
Rev Bras Cienc Solo 2023;47:e0220149 The predictors selection based on importance classifications follows a Backward Selection approach, often used in Random Forest models, aiming to integrate physical-water attributes in a robust and refined way.This approach leverages multicollinearity principles to limit the number of predictors and select variables based on their importance in decision trees, and, therefore, it also helps select the variables to be used in the final model by reducing predictors in target scores.
To reduce data dimensionality, Principal Component Analysis (PCA) was applied.Abdi and Williams (2010) defined PCA as a multivariate technique to describe data observed through several inter-correlated quantitative dependent variables that aim to extract important information.The technique represents the data as orthogonal variables called principal components, enabling the patterns and similarities identification among observations and variables as multi-dimensional vectors by grouping input variables responses without losses.
The Random Forest algorithm was chosen to classify water resources given its easy implementation and robustness (Blum and Langley, 1997;Bradley et al., 1998).Predictive property values obtained from the Random Forest model (quality parameters) were used to determine the method with the best performance in estimating physical-hydric data.
The input data were previously separated into training (70 %) for the Random Forest models implementation, and testing (30 %) for quality validation purpose.Quality criteria set for models encompassing the selected and reduced variables (from the database and based on the implemented methods) resulted from careful statistical analysis, including Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), determination coefficient (R 2 ), and Random Forest model importance evaluation.This statistical approach was used to assess the model's accuracy and precision under bir/Ksat values estimated (Schaap and Leij, 1998;Schaap, 2004).In addition, the model with the best prediction performance and highest correctness degree was selected based on the validation results, because validation implementation used 30 % of the non-trained sampling data.
The selection methods and data dimensionality reduction allowed predicting bir and Ksat values for the Macacu basin entire area grid.The surfaces mapped based on both attributes resulted from data separation criterion into two soil layers: 0.00-0.20 and 0.20-0.40m.The final resulting cartographic products included four maps estimating bir and Ksat attributes at two soil layers (surface 0.00-0.20 m and subsurface 0.20-0.40m) in the basin, based on the Random Forest classification.
The analyses were conducted using applied statistical routines in R language and RStudio software environment (R Core Team, 2020; Rstudio Team, 2020).Packages such as Raster, sp, sf, shapefiles, openxlsx, Sen2R (Sentinel-2A images treatment and process), caret, corrplot, usdm, FactoMineR, randomForest, and ggplot2 (graphic plotting) were utilized for data processing (manipulate geo-spatial data in tabular, matrix and vector formats), selection methods development, image processing, classification, and visualization.The cartographic products were produced using open-source software Quantum GIS (QGIS) (QGIS Development Team, 2020).The described methodological steps summary is presented in the figure 2 flowchart.

Autocorrelation and principal component analysis
Autocorrelation method performs an initial examination of data behavior through meticulous analysis conducted by the researcher, considering prior knowledge about the database and utilizing the Pearson's correlation matrix.Inman (1994)   Spectral data within the matrix (Figure 3) exhibit interdependencies that may indicate the multicollinearity or high dimensionality presence, wherein multiple data points represent identical or highly similar features.Principal Component Analysis (PCA) was employed to assess the spectral data contribution, addressing this issue.The PCA analysis accounted variance of over 90 % in both principal components, enabling a comprehensive understanding of the model through a two-dimensional spectral variables analysis (Figure 4a).Upon principal components spatial observation, it becomes apparent that the variables (Figure 4b) are separated into three distinct groups based on their spectral characteristics.Notably, the IO and GSI indices contributed proportionally less to group 1 in physicalhydric variables explanation.Figure 4b illustrates the variables separation into group 1 (10 variables), group 2 (3 variables), and group 3 (10 variables).It is worth noting that two variables from group 1, namely the IO and GSI indices, warrant closer examination during the modeling stage.These results underscore the spectral data significance derived from remote sensing in modeling, as they account for most of its variability.It is advisable to reduce the dimensionality of the data associated with these variables (Figure 4) to mitigate issues arising from multicollinearity.
Further discussions about the variables analyzed in this step can be found in dos Santos et al. (2019) and Santos et al. (2020) studies.The Guapi-Macacu basin soil attributes characterization is wider discussed by Santos et al. (2022) and can be integrated with the analysis carried out by the cited authors.Correlation and dimensionality analyses play a crucial role in multivariable studies.However, they serve as preliminary steps before a specific method application, to explore and identify patterns in the study site's data.But they do not account for factors such as homoscedasticity (trends absence in error variances), multicollinearity, and normality.Neglecting these considerations can have a detrimental impact on drawing direct conclusions about the study outcomes.Addressing this, statistical requirements demanded by the chosen methods can be met by applying criteria such as the Breusch-Pagan test (Breusch, 1978), Shapiro-Wilk test (Shapiro and Wilk, 1965), and Durbin-Watson test (Durbin and Watson, 1950;Watson and Durbin, 1951;King, 1992); to the databases.These tests help adjust accordingly the data, ensuring compliance with the necessary statistical assumptions.
Multicollinearity, as indicated by Mansfield and Helms (1982), can introduce various undesirable effects on coefficients estimated through multiple regression analysis.Hence, it is essential for researchers to possess the skills to identify its presence.In this study, three distinct methods (VIF, RFE and Stepwise AIC) were examined.These methods aimed to select the appropriate variables for each assessed attribute, considering their respective depths (surface or subsurface).Furthermore, they aimed to evaluate the quality of the estimates obtained for these attributes using the Random Forest machine learning classifier.By employing these methods, researchers can make informed decisions regarding variable selection and assess the accuracy of attribute estimation achieved through the modeling process.

Hydropedological data spatial modeling
Based on database separation categorized according to soil layer (0.00-0.20 and 0.20-0.40m) and the variables under investigation (bir and Ksat), response variables were selected by calibrating three different spectral data dimensionality reduction methods and analysis.Each applied method (Table 3) yielded different output variables definitions (explanatory) that displayed significant positive or negative correlations (at a significance level of α = 5 %) with the physical-hydric variables.Based on the resulting findings (Table 3), most methods employed resulted in more than ten explanatory variables selection.Furthermore, these methods demonstrated a multicollinearity lack and strong interference in data correlation.Focusing on bir and Ksat estimates within the depth range of 0.00 to 0.20 m, it can be observed that the RFE selection method yielded the lowest number of response variables composing the model, with 11 and 14 variables estimated total, respectively.However, it is important to note that the methods employed different approaches in selecting covariates based on the four adopted criteria, thereby incorporating either the topographic or radiometric aspects of the study site.Figures 5a to 5d provide an analysis based on the RFE method to determine the optimal variables number that would result in the lowest residue in the Random Forest model (as indicated by the lowest RMSE).
Dashed lines in graphics depicted in figures 5a, 5b, 5c and 5d represent the optimal variables number required to achieve the lowest Root Mean Squared Error (RMSE) values when utilizing the Random Forest model.This criterion is specifically applied in analysis based on RFE method.Among the four attributes modeled, only the model based on vib 0.00-0.20 m attribute demonstrated lowest RMSE value with the smallest variables number (11) after 99 iterations (Figure 5a).The remaining three attributes -Ksat 0.00-0.20 m , Ksat 0.20-0.40m , and vib 0.20-0.40m (Figures 5b to 5d   All models reached an accuracy higher than 70 % (R 2 >0.70) in predicting the target attributes (Ksat and bir) at 0.00-0.20 and 0.20-0.40m soil layers (Table 4).These predictions were assessed using cross-validation criteria, indicating well-calibrated models within the RF method.The Random Forest model adjustment, across all three proposed methods, resulted in a RMSE error stability of approximately 300-350 random trees generated for the assessed physical-hydric variables (Figures 6 and 7).Among these RF models, the ones that exhibited the best accuracy in variables prediction, based on RFE method, are highlighted in colors in figures 6 and 7, with RMSE stability starting at 300 trees.
The Random Forest model's performance evaluation (response) during validation, where the trained models were compared to test values (30 % of the separated database), revealed significant discrepancies between predicted values (by model estimation) and the actual values (measured in situ).This discrepancy is further supported by low R-squared (R 2 test ) values obtained during the testing phase, which were below 20 % for the implemented methods based on criteria analysis (depths and independent variables), except for the bir variable assessed for the 0.00-0.20 m layer which recorded higher R 2 values.The R 2 values for bir variable assessed for the 0.00-0.20 m layer ranged from 28 to 48 % (Table 5).These results highlight challenges in accurately predicting the target variables using the implemented methods.
The Random Forest model's quality assessed the internal validation (Table 5) and revealed overall lower values for the applied evaluation metrics (RMSE, MAE and R 2 ).These results indicate that the adjusted models have less than 20 % explanatory variability for Ksat and bir attributes in the assessed regions.This finding can be attributed to the lack of data values or their low significance, specifically in the elevation values within the basin's lowland.This low significance is likely due to this region's planar curvature (Figure 8) and its proximity to Guanabara Bay, which experiences constant sedimentation processes as water from the basin flows out through the river mouth.The observed result may also be associated with training data overfitting or random selection of validation values.To solve this problem, increasing the estimated variables sample size while considering criteria such as soil classification and measurement points homogeneity in the study site can be a potential solution.Despite careful survey planning, it was not feasible to collect data in regions characterized by high human interference levels and dense forest cover.These landscape features are present in Cachoeiras de Macacu, Guapi-Açu, Guapimirim, and Itaboraí counties, which are influenced by the Guapi-Macacu basin.These factors can significantly impact the Ksat and bir values estimation for the 122 sampled points using pedotransfer models, as indicated in previous research (Santos et al., 2022).As a result, there may be inherent randomness in errors and discrepancies arising from external mapping sources.However, it is worth mentioning that the 122 sample points estimated achieved high accuracy and precision in the mapping process.
Considering the region's sedimentary composition, it is expected that the physicalhydric attributes exhibit varying values in areas with higher clay content, particularly dispersed clay.These areas are characterized by high activity, resulting from a balance between micro and macropores that favor capillarity and water retention.Additionally, the organic matter presence in the soils promotes the colloid formation, increasing the water particles specific contact surface and further enhancing water retention.
Conversely, regions with high sand content and thick, poorly structured materials exhibit lower values for these physical-hydric attributes.This is explained by the macropores prevalence, which reduces the soil's capacity to transfer water through capillarity and retain it long enough for aquifer recharge.
Moreover, the region with mineral soils in the basin (feldspar and associated quartz fractions), as indicated by the 1:50,000 scale geological map (DRM, 2019), exhibits higher soil density, leading to reduced infiltration processes and water capillarity.Values highlighted in the map follow bir and Ksat variables interval scale (Figure 8).Both variables exhibited lower values in the transition zone between profiles (surface to subsurface).This observation suggests a reduced capacity for infiltration and capillary flow, indicating lower water retention in the deeper basin soil profiles.
Conversely, the mountainous region surrounding the basin, encompassing Petrópolis, Teresópolis, and Nova Friburgo counties, displays the highest bir and Ksat values in subsurface profiles.This can be attributed to the dense vegetation, specifically the Atlantic Forest biome.These finding evidences these areas as priority sites with significant potential for water resource conservation within the basin.It is important to recognize that soil degradation and deforestation in these regions can contribute to water scarcity issues.Therefore, the preservation and restoration of these sites are crucial for maintaining water availability in the basin.
Considering the geological characteristics of the herein-assessed region, the Center-Northwestern fraction of the basin exhibits notable features in the aeromagnetic map, indicating a higher concentration of water due to the presence of deeper rocks.This specific area encompasses the Rio de Janeiro Petrochemical Complex (COMPERJ), which represents the region with the highest urban density and includes the flooded lowland areas of the basin.Consequently, these regions display distinct spectral responses.
The airborne gamma spectrometry map highlights elevated levels of Potassium in the Center-Northwestern and Center-Southeastern sections and a notable Potassium and thorium combination (high concentration values) on the surface, particularly in other regions of the basin.The flight conducted by CPRM revealed a dummy correlation effect, explicitly observed in the COMPERJ restricted area.This effect impacted on data interpolation and showed noise in the middle portion of the basin, characterized by a faint linear feature (tenuous lineament) extending towards the East-Western basin part.Nevertheless, this procedure was crucial to feasible the modeling process, as it required a complete matrix structure for value prediction using the Random Forest algorithm.

CONCLUSIONS
The three evaluated methods demonstrated distinct selection responses, resulting in a reduction of 53 initial input variables during the pre-modeling stage.Specifically, the number of variables was reduced to 30, 11 and 36 for Ksat at Z = 0.00-0.20 m; 17, 27 and 31 for Ksat at Z = 0.20-0.40m; 30,14 and 36 for bir at Z = 0.00-0.20 m; 17, 37 and 26 for bir at Z = 0.20-0.40m, respectively.The Iron Oxide index was the most frequently selected variable by the applied methods among all analyzed data set variables, except for RFE applied to bir and Ksat attributes in soil surface layers (0.00-0.20 m), where this variable was not returned.
The VIF and RFE methods yielded the smallest number of explanatory variables for the 0.00-0.20 m (RFE Ksat0.00-0.20 m = 11 variables; RFE bir0.00-0.20 m = 14 variables) and 0.20-0.40m (VIF Ksat0.20-0.40m = VIF bir0.20-0.40m = 17 variables) assessed variables layers, respectively.This reduction in data dimensionality by removing redundant variables contributed to more informative components for the overall predictive models.Consequently, the final models adopted for data modeling and spatialization (mapping), based on separation per soil profile and study variable, were VIF and RFE for Ksat and Stepwise AIC and RFE for bir, respectively, considering both surface and subsurface variables profiles.These models played a crucial role in selecting and reducing input data dimensions at the pre-modeling stage, especially for water resources digital mapping in the studied basin's soils.
In terms of modeling quality, RFE and Stepwise AIC methods consistently yielded the best results for Random Forest models, regardless of the study variable (RFE: Ksat 0.00-0.20 m = 14.33 % and bir 0.00-0.20 m = 47.35 %; Stepwise AIC: Ksat 0.20-0.40m = 10.88 % and bir 0.20-0.40m = 20.03%) and layer depth (applied to layers 0.00-0.20 m and 0.20-0.40m).The explained variability by these models ranged 10.88-47.35%, with the highest values achieved for bir at a depth of 0.00-0.20 m.However, it is worth noting that the RF models showed better adjustment for Ksat in both surface and subsurface profiles compared to bir, due to the higher spatial variability of the latter, which affects the algorithm modeling process.
The results provided a deeper comprehension of saturated water conductivity (Ksat) and basic soil infiltration velocity rate variability in the study area.The spectroradiometric and topographic data derived from Digital Elevation Models (DEM) integration led to more robust models, as evidenced by the high-quality digital soil physical-hydric attributes representation.Radiometry played a significant qualitative role in analyzing the soil particle sizes composition at the surface level.However, there were limitations that hindered a comprehensive analysis and prevented the determination of the potential of these data for numerical modeling using machine learning algorithms.
The approach herein adopted proved valuable in enhancing inherent relationship understanding among surface spectroradiometry, topography, soil composition, and water response at the soil surface, involving all the assessed databases.It raised important considerations regarding the selection of variables, showing how tenuous variables selection based on the Ksat and bir classification process is in the overall digital soil mapping context.

Figure 1 .
Figure 1.Study area location map and spatial distribution of the 122 points associated with hydropedological information about the assessed data set.Scale: 1: 300.000DEM spatial resolution: 20m Analytical Hillshade applied on DEM Gitelson et al. (2001);Hunt et al. (2011) Based on the correlation matrix, spectral indices demonstrated a moderate correlation with Ksat and bir attributes.This correlation can be either positive or negative.The following spectral indices show a positive correlation: Clay Minerals, EVI, Ferruginous Regolith, NDRE, NDVI, NLI, SAVI, VARI, and TDVI.On the other hand, Ferrous Minerals, GSI, and Iron Oxide exhibit a negative correlation.Regarding the terrain covariables, DEM morphometry and primary variables (Slope, Aspect, and Elevation) exhibited a weak to moderate positive correlation (0.4< R <0.6).Similarly, hydrology channel covariables (such as MRRTF and MRVBF) presented a weak to moderate, albeit inverse, correlation.In other words, as physical-water attribute values increase, these variables values decrease.This analysis aligns with the local relief features and suggests effective water drainage, primarily driven by surface and/or vertical runoff throughout the basin.This finding should be thoroughly investigated in the prediction map.Visibility, light, and hydrology covariables showed a weak to moderate correlation (0.2< R <0.4) when compared to the overall dataset.

Figure 4 .
Figure 4. Spectral variables are represented based on the PCA model's explanatory principal components, wherein (a) is the principal components' importance in PCA; (b) is the principal components' separation in PCA, with each contribution expressed in the right legend.
) were required 27, 14, and 37 variables, respectively, Rev Bras Cienc Solo 2023;47:e0220149 to achieve the ideal adjustment in RF model.Notably, the RFE model applied to vib 0.20- 0.40 m demanded the largest variables number in comparison to its explanatory set when compared to the other two implemented methods.Specifically, it required one more explanatory variable than the Stepwise AIC method evaluated for Ksat 0.00-0.20 m and vib 0.00- 0.20 m .Breusch-Pagan, Durbin-Watson and Shapiro-Wilk tests were applied to the database.These tests yielded significant values necessary for the null hypothesis (H 0 ) acceptance in favor of the alternative hypothesis (H 1 ) at a 5 % significance level (α = 5 %, thus, β = 95 % confidence level

Figure 5 .
Figure 5.The RFE method graphics (number of explanatory variables versus RMSE adjustment) of the assessed physical-hydric attributes and their respective layers: (a) Ksat 0.00-0.20 m ; (b) Ksat 0.20-0.40m ; (c) bir 0.00-0.20 m ; (d) bir 0.20-0.40m .The dashed line highlights the reach of the optimum number of variables selected through RFE.

Figure 7 .
Figure 7. Random Forest model adjustment graphics evidencing the amount of trees versus error minimization recorded for Ksat variable, based on the assessed layer: (a-c) bir 0.00-0.20 m ; (d-f) bir 0.20-0.40m ; and the three analyzed methods: (a,d) VIF; (b,e) RFE; (c,f) Stepwise AIC.

Table 1 .
Terrain covariables concerning the topographic physical-hydric and thermodynamic DEM's primary and secondary attributes

Table 2 .
Vegetation indices deriving from multispectral sensor's spectral bands of the Sentinel-2 mission
). Shapiro-Wilk test was utilized to assess data normality residuals (H 0 : normality of residuals versus H 1 : non-normality of residuals).Breusch-Pagan test was employed to assess variances in residues homoscedasticity (H 0 : equal varianceshomoscedasticity versus H 1 : different variances -heteroscedasticity). Lastly, Durbin-Watson test was applied to examine the correlation presence among residuals, which serves as a multicollinearity indicator (H 0 : autocorrelation among residuals equals zero -residues independence versus H 1 : autocorrelation among residuals different from zero -residues dependence).The test stage is crucial in ensuring that statistical assumptions will be fulfilled, especially considering that VIF and Stepwise models are based on regression analysis principles.The p-values obtained from tests were within the required interval in all methods (p-value < α = 5 %), indicating that no data corrections or treatments were necessary.The trained methods quality performance using 70 % of the database, which was previously separated through RF model, is shown in table 4.

Table 4 .
Random Forest model quality in physical-hydric data training stage based on the applied selection methods (70 % of the database)

Table 5 .
Random Forest model quality in physical-hydric data validation test stage (30 % of the database)

Assessed variable Applied method Quality metrics evaluated for Random Forest RMSE test MAE test R 2 test
Values based on Random Forest model validation of data separated at the test stage.