Optimized data-driven pipeline for digital mapping of quantitative and categorical properties of soils in Colombia

Soil maps provide a method for graphically communicating what is known about the spatial distribution of soil properties in nature. We proposed an optimized pipeline, named dino-soil toolbox, programmed in the R software for mapping quantitative and categorical properties of legacy soil data. The pipeline, composed of four main modules (data preprocessing, covariates selection, exploratory data analysis and modeling), was tested across a study area of 14,537 km located between the departments of Cesar and Magdalena, Colombia. We assessed the feasibility of the toolbox to model three soil properties: pH at two depth intervals (0.00-0.30 and 0.30-1.00 m), soil taxonomy (great group) and taxonomic family by particle-size, according to a set of 25 environmental factors derived from auxiliary layers of climate, land cover and terrain. As a result, we successfully deployed the proposed semi-automatic and sequential pipeline, yielding rapid digital soil mapping (DSM) outputs across the study area. By providing multiple outputs such as tables, charts, maps, and geospatial data in four main modules, the pipeline offers considerable robustness to support outcomes and analysis of a DSM project. Future studies might be interesting to expand on further machine learning frameworks for predictive modeling of soil properties such as ensembles and deep learning models, which have shown a high performance for DSM.


INTRODUCTION
Soil classification is a method for organizing and communicating knowledge and perceptions about soil properties. Soil maps provide a method for graphically communicating what is known about the soil properties spatial distribution in nature. Among the different approaches to generate those maps, there is a wide adoption of the digital soil mapping (DSM) framework. Incepted by McBratney et al. (2003), the framework has been widely applied to produce and analyze spatio-temporal patterns of soil properties according to environmental covariates such as climate, terrain, vegetation and land use.
The existing methods in DSM can be grouped into two main modeling types, conventional (statistical and geostatistical) and machine learning (ML). In the former type, a soil property is modeled as a linear relationship between the property and state factors, accounting for the deterministic portion of the total variation, and a spatially dependent stochastic portion by using kriging methods (Keskin and Grunwald, 2018). While scholars have proposed multiple geostatistical for soil mapping, they are computationally demanding if the sample size and/or the number of prediction locations are large (Cressie and Johannesson, 2008). Moreover, modeling non-linear relationships between soil properties and numerous cross-correlated environmental covariates are not straightforward and introduces additional challenges, e.g., estimating many parameters (Wadoux et al., 2020a). Unlike geostatistical methods in which the transformation of the original observations is often required to satisfy assumptions, ML algorithms do not assume the observations' distribution. In addition, they are more suitable for large area predictions and designed to handle non-linear relations and complexity found in soil data (Padarian et al., 2020).
Besides the contribution to soil research, ML-based methods have accelerated the production of first versions of digital soil maps at the national (Odgers et al., 2012;Akpa et al., 2014;Mulder et al., 2016;Padarian et al., 2017), continental (Hengl et al., 2015;Ballabio et al., 2016) and global (Hengl et al., 2017) extent. These maps are aligned to global efforts in storing, coding and harmonizing legacy soil data, especially soil profiles with soil analytical data (Arrouays et al., 2017). In Colombia, the Geographic Institute Agustin Codazzi (IGAC), through the Sub-Directorate of Agrology leads the inventory, study, analysis and monitoring of the country's soils for their management to support national land planning. As part of the Global Soil Partnership, IGAC has recently generated the first version of the national soil organic carbon map (Bolívar Gamboa et al., 2021) according to FAO's cookbook (FAO, 2018). Agrosavia recently launched the IRAKA platform, the first Colombian soil information system providing spatial data of multiple soil properties at the surface layer, from 0.00 to 0.20 m depth (so-called topsoil) across the Cundiboyacense high plateau (Araujo-Carrillo et al., 2021). This development roots from the concept of hybrid soil information system (SIS) by using both the techniques and data of traditional soil studies and DSM models and information technologies such as database management, ML-assisted modeling, and geographic web services.
To contribute to the optimization of SIS's for generating soil-related information in digital formats, this study aimed to test the hypothesis that a semi-automatic and sequential pipeline, compiling a variety of statistical and ML algorithms commonly used in soil mapping research, can offer robustness and rapid experimentation required for a DSM project. This involves the deployment of a pipeline in the software R, named dinoSoil-toolbox, composed of four sequential modules (data preprocessing, covariates selection, exploratory data analysis and modeling) across a large heterogeneous geographic area. To achieve this objective, we 1) tested the proposed toolbox for a rapid generation of spatial data of three soil properties: pH at two soil layers (0.00-0.30 m and 0.30-1.00 m), soil taxonomy (great group) and taxonomic family by particle-size from soil databases gathered and curated by IGAC in a mountainous valley terrain of the Momposine depression located between the departments of Cesar and Magdalena, Rev Bras Cienc Solo 2021;45:e0210084 Colombia; 2) informed the multiple formats (charts, tabular and georeferenced layers) of soil-related information outputs provided by the toolbox for their inspection and analysis; and 3) made a shareable and reproducible toolbox following open science principles with readable documentation, sample data, and code available in a public GitHub repository.

Study area
The area of interest (AOI) encompasses a surface of 14,537 km 2 located between the departments of Cesar and Magdalena, Colombia, between 8° 56' 11" and 10° 52' 4" north latitude and 73° 4' 56" and -73° 57' 39" west longitude (Figure 1). This area is part of the research program of land policy of IGAC's Sub-Directorate of Agrology. It has been subject to multiple soil surveys, most of them compiled by this program. With elevation values between -31 and 5364 m and an average air temperature of 18 °C, the dominant landscapes are mountains and floodplains.

Soil surveys
IGAC's Sub-Directorate of Agrology provided field and laboratory observational soil databases (so-called legacy soil data). Both datasets describe soil properties at a different level of detail but at the same unit of observation (profile). The former database contains 1858 profiles and refers to soil morphological and taxonomical data from soil-survey campaigns conducted in 2020. These campaigns were conducted for defining soil cartographic units across the study area. After defining these units, more detailed soil analytical data were collected and analyzed through laboratory methods. As for laboratory database, this one contains information on pH(H 2 O) (1:2.5), soil texture (sand, silt and clay percentage, Bouyoucos method), organic carbon (%, Walkley-Black chromic acid wet oxidation method), and bulk density (Mg m -3 , core method). Along with these databases, an additional database containing the location of soil profiles was also provided.

Auxiliary data
A total of 25 environmental covariates were constructed and harmonized from auxiliary layers provided by IGAC's Sub-Directorate of Agrology and external sources representing three soil forming factors: land cover, geomorphology, and climate (Table 1). Most of these covariates, 21 out of 25, were derived from a digital elevation model (DEM), which also defined a mapping resolution of 250 m. These multiple terrain parameters were used to represent the role of geomorphology in the spatial variability of soil properties and classes. A multiband raster stack containing all environmental information was generated, resampled, and harmonized on a 250 × 250 m pixel size grid over the AOI. Note a more detailed study using the proposed pipeline was conducted over the same study area but due to restricted access of the DEM used of 30 m, only the results with the open-access DEM are reported. While the community in general has access to 30 m DEM data, they often require post-processing according to the amount of missing data, i.e., holes in the study area. For this investigation, the restricted 30 m version was void-filled by DEM post-processing experts.

Proposed pipeline
The proposed semi-automatic pipeline comprises four main modules: data preprocessing, covariates selection, exploratory data analysis and modeling. These modules were built upon previous developments (code) provided by IGAC's Sub-Directorate of Agrology and adjusted according to the input data described above. Making decision processes are needed at some points along the workflow, mainly in the data preprocessing and covariates selection; therefore, we brought statistical criteria that, in conjunction with expert knowledge, support these decisions.

Data preprocessing
This step involves constructing data matrices containing soil profiles with each target property and related environmental layers. This procedure was conducted by extracting, i.e., intersecting the covariate values of each sample point. Then, each matrix is randomly divided using 70 % for model training and 30 % for test purposes (Guevara et al., 2018).
Before the above data extraction and partition, some data preparation procedures were conducted according to the type of the target property. An additional data preparation step is required for quantitative properties, such as pH, before the extraction procedure. This preparation refers to interpolate values at user-defined depths. It mainly requires reshaping field and/or laboratory databases from horizontal to long format. For the study area, pH was interpolated at two layers, 0.00-0.30 and 0.30-1.00 m. Interpolation is a common procedure in DSM to make the depth of the input soil dataset uniform. In this regard, a quadratic function of depth with equal areas (splines) method (Bishop et al., 1999) was implemented as part of the tool and used to interpolate pH at the target depths. For great group and family by particle-size, only those classes with at least five observations were retained following FAO (2018)´s good practices in DSM. After the above data preparation, two final procedures are considered for generating the input matrices per target property for modeling. The first procedure refers to automatically removing covariates with zero or close to zero variance. The second consists of detecting extreme values, i.e., outliers (values below the 5th percentile, or above the 95th percentile) per covariate. Users can handle observed outliers by different methods such as removal or interpolation with mean or median. For the input data of this study, we use the removal option.

Covariates selection
Recursive feature elimination with Random Forest (RFE-RF) was used to identify the optimal subset of covariates useful for prediction. After removing covariates with zero or near-zero variance, this algorithm iteratively eliminates the least remaining promising predictors based on an initial measurement of variable importance in a reference model, in this case, calibrated using the Random Forest algorithm (Guyon et al., 2002). The RFE-RF was set with a 5-fold repeated cross-validation, and variable importance was assessed by the mean decrease of Gini impurity (Guevara et al., 2018).

Exploratory data analysis
Conducted over the training input data matrices, these analyses are complementary since they determine possible relationships, interactions, or dependencies between the covariates. For the descriptive analysis, measures of centralization and dispersion are reported. Regarding the statistical analysis, normality tests and non-parametric tests are carried out. According to a consensus of the amount of legacy soil data for DSM (n >50), the tests of Lilliefors and Jarque Bera are suitable to assess the assumptions of normality in quantitative variables. The normality test should be ignored when using ML algorithms to handle non-linear relationships, e.g., random forest, support vector machine. Kruskal-Wallis (for multiple comparisons) or Wilcoxon-Mann-Whitney U (for single comparison) with the Bonferroni correction tests were used to statistically determine differences in the distribution of each covariate among the target soil properties.

Modeling
Model training: five algorithms were selected from different classes to model the target soil properties. Table 2 shows the list of ML algorithms considered with their family, the name stated in the R library, and the variables modeled. Only Random Forest, Extreme Gradient Boosting and Generalized Linear Models predict both quantitative and categorical variables. The support vector machine with radial kernel and cubist algorithms are specific to model pH as multilayer perception and multinomial logistic regression are categorical variables. Model evaluation: For selecting the best quantitative and categorical variables model, 10-fold repeated cross-validation was used from the training partition (Hengl et al., 2017). It is worth mentioning, each of these models is calibrated using a random grid search with a size of 20 for tuning its main parameters. The value of grid search might change according to the user's knowledge in the study area or modeled soil property. A larger number implies longer model training times, in particular for ML algorithms such as SVM.
It is then suggested to start with a reasonable number as we set in this research. Once the best model with its calibrated parameters is identified per target variable, the external performance is measured using the test dataset. The main metrics for assessing model performance in both stages were the root mean square error (RMSE) and coefficient of determination (R 2 ) for pH, and overall accuracy (OA) and kappa coefficient for the categorical variables. In addition to these metrics, other complementary ones were considered for the external validation according to the variable type. For quantitative variables, the metrics included the R-square (R2), coefficient of efficiency (COE), index of agreement (IOA), amount of variance explained (AVE) as reported by Araujo-Carrillo et al. (2021). For the categorical variables, the F1-score was incorporated in the assessment.
Model uncertainty: An essential aspect of DSM involves the measurement of the uncertainties of the ML-based predictions. The methods are different according to the variable type. We propose to derive estimates of uncertainty for quantitative variables by fitting a quantile regression forest model (QRF). As Vaysse and Lagacherie (2017) suggested, this method allows interpolating the response of the best model's residuals for each unobserved location. We use parallel computing at the pixel level to compute the standard deviation of predictions made with QRF. As this is a pixel-wise process, we can obtain a continuous surface of model uncertainty. Regarding the categorical variables, the uncertainty is measured by the scaled Shannon Entropy Index (H s ), which is calculated by using the probability maps of the target classes (Equation 1) in both great groups and taxonomic family by particle-size.
in which K is the number of possible classes; log K is the logarithm to base K and p k is the probability of class k. Ranging from 0-1, 0 indicates no ambiguity, and 1 indicates maximum confusion. This metric should not be confused with classification accuracy assessment as H s is an internal accuracy measure derived from the model and not based on a comparison of predictions with validation data, such as the OA and kappa metrics (Hengl et al., 2017).
Model use and predictions: it mainly consists in predicting over unobserved locations by using the best model fitted in the previous steps. The multi-band raster stack of covariates was used for predicting each target variable across the whole surface of the AOI.

Software and implementation
All the steps of the pipeline presented here were implemented in R software version 4.0.2. SAGA version 7.8.2 was used to generate the derivatives of DEM. The main R-libraries used according to the key steps of the pipeline are presented in table 3. We deployed these steps using the sample data on a machine with a 4-core 2.5 GHz Intel Core i5 CPU, 16 GB RAM and macOS.

Data preprocessing
Input data matrices were generated separately by the target variable. number of profiles decreased from 17 % (pH) to 59 % (taxonomic family by particle-size). The modeled classes for the taxonomic family by particle-size and USDA great group were 15 and 20, respectively.

Covariates selection
The number of optimal covariates variated by target soil property. Figure 2 shows an example of a diagram obtained using the RFE-RF algorithm for pH 0.00-0.30 m. Eight covariates are suggested as the optimal number of covariates (see the dot marker). For the remaining variables, the optimal covariates were 5, 3 and 7 for pH 0.30-1.00 m, great group, and taxonomical family by particle size, respectively. The following section provides a list of the covariates selected.

Exploratory data analysis
According to the statistical analysis for pH, the Lilliefors and Jarque-Bera tests indicate that this variable is not normally distributed at two layers. The Kruskal-Wallis reveals significant differences (p<0.0001) in the distribution between pH 0.00-0.30 m and two covariates (climate and terrain) ( Table 5).
The post-hoc test using Wilcoxon-Mann-Whitney U with the Bonferroni correction indicated the climate levels in which pH 0.00-0.30 m resulted significantly different. They were between humid warm and dry warm (p<0.001) as well as very dry warm and dry warm (p<0.05).

Table 3. Lists of key R-libraries used by the pipeline per step
Step

Modeling
Overall, Random Forest had the best performance among the five algorithms proposed to predict pH at two layers, great group and taxonomic family by particle-size classes. Figure 3 illustrates boxplots with the performance of the predictions models of pH 0.00-0.30 m and great groups using five trained algorithms (with different configurations, i.e., parameters).
From the different configurations of trained Random Forest algorithms, the one that performed best in cross-validation with the training partition was used for  independent evaluation with the test set. This assessment can be inspected through scatterplots and confusion matrices for quantitative and categorical variables, respectively (Figure 4).
From the above outputs, a set of evaluation metrics were computed for the target variables. According to the main metrics, RMSE and Overall Accuracy for quantitative    and categorical variables, respectively, the best model has error values of 0.63 and 0.68 for pH 0.00-0.30 m and 0.30-1.00 m, respectively; and accuracy values of 60 and 67 % for great groups and taxonomic family by particle-size, correspondingly.
To report the predictions' error, the uncertainty estimation was a key to highlight areas in which the best models tend to have a lower certainty. For instance, figure 5 shows maps of uncertainty and prediction of pH 0.00-0.30 m and great groups. For pH, the dark areas indicate a higher standard deviation of the residuals, which can be associated with a high error. For great groups, the interpretation remains different as the dark areas indicate a higher mix of the mapped classes. These mixed pixels can be related to the mapping unit, in this case, 250 m, which might be too coarse to discriminate against a dominant great group.

DISCUSSION
The proposed pipeline, tested over a considerably large area of 14,537 km 2 , shows its feasibility to handle multiple data-driven methodologies (from preprocessing to modeling) to generate information related to the spatial distribution of soil properties. According to a review on 150 studies about mapping soil properties using ML algorithms, the use of legacy samples as they were tested in the pipeline, is predominant for local and regional scale areas (about 10 4 km 2 ) (Wadoux et al., 2020a). For the deployment of the pipeline, 1858 soil profiles or 7.8 units per km 2 were considered. However, the initial density was reduced between 17 and 59 % after the preprocessing step per target variable. This number is still relatively high if compared with that found by Wadoux et al. (2020a), who reported an average sampling of 0.24 units per km 2 .
The recursive feature elimination algorithm, in this case, RFE-RF widely used in previous DSM studies (Shi et al., 2018;Gomes et al., 2019), recursively removes the least important covariates from the initial pool, with little or no decrease in model prediction accuracy. While the optimal subsets of environmental covariates yielded by the RFE-RF provided reasonable results for each target variable according to the mapping resolution of 250 m, further covariates representing multi-scale or temporal variation should be incorporated. The covariates have to represent only soil-forming factors and, where possible, a physical explanation to ensure an unbiased ML-DSM soil knowledge generation (Wadoux et al., 2020b).
After the selection of covariates, a complementary component of the pipeline is the exploratory data analysis. The statistical analysis is an added value to this component as it complements the interpretation of predictions obtained by the ML algorithms. For instance, the statistical analysis supported a thematic validation conducted over maps generated for the same soil properties but with the restricted access DEM. For either modeling exercises using open or restricted DEM, it was evident that pH 0.00-0.30 m has a significant response according to the climate levels. This evidence corroborates previous studies, which indicate soils from different climates have distinct soil. Specifically, climate can affect the process of soil chemical reaction and consequently influence soil pH (Zhang et al., 2019).
Regarding the algorithms assessed, Random Forest yielded the best performance consistently for the target soil properties. This particular algorithm is the most popular for modeling quantitative and categorical variables in ML-DSM (Wadoux et al., 2020b). It is worth mentioning, the random grid search for parameter tuning allows maximizing the performance of all trained models. For instance, the main parameters tuned for the RF algorithm were the min node size, number of predictors per division and partition rule. Additionally, to facilitate the interpretability of the best model, the pipeline returns charts of the covariate importance. For the RF algorithm, the most essential covariates are identified using the mean decrease in the variance of the response (Wright and Ziegler, 2017).
Rev Bras Cienc Solo 2021;45:e0210084 After selecting the best model, the assessment with the test partition indicates an acceptable accuracy according to studies in similar environmental conditions, i.e., tropics and input data. For instance, using legacy soil data collected across the Cundiboyacense high plateau, Colombia, Araujo-Carrillo et al. (2021) reported a RMSE of 0.57 for pH at the topsoil (0.00-0.20 m) and overall accuracy of 48 % for the taxonomic family by particle-size classes (9 groups) modeled at a spatial resolution of 125 m using the exact Random Forest implementation in R (ranger). While there were no studies conducted at similar conditions to compare results of the second depth (0.30-1.00 m), the RMSE is close to the first depth. Wadoux et al. (2020a) indicate most of the existing ML-DSM studies (around 70 %) predicted a soil property or class for a single depth (topsoil). The existing studies at multiple depths mostly focus on modeling soil properties in temperate climates (Lacoste et al., 2014;Viscarra Rossel et al., 2015).
It is worth mentioning the observed performances of the best predictive models are considered poor. One of the potential explanations for poor performance is the quality and robustness of datasets which are of greater importance than the classifier itself (Meir et al., 2018). For reproducibility purposes, we intentionally deployed the tool using coarse spatial datasets, harmonized on a 250 × 250 m pixel size according to the spatial resolution of the input DEM. Cavazzi et al. (2012) claim the landscape complexity of the study area can play a pivotal role in choosing the optimal resolution for modeling soil properties. The authors found varied morphological areas with abrupt changes in topography, like the Momposine depression in Colombia, can yield better prediction results with fine resolutions (30 m). We, therefore, expect the predictive models can be improved with a higher resolution DEM. In addition to the spatial datasets, the quality of the legacy soil data might impact models' performance. In figure 4, the scatter plot between observations and predictions of pH at 0.00-0.30 m shows a poor performance in predicting extreme pH values and minor classes for great groups. Taking the imbalance distribution into account would improve the performance of the predictive models. In this regard, some scholars have proposed approaches to tackling imbalanced datasets from low-quality legacy soil data, few of them tested across tropical environments (Hounkpatin et al., 2018).
Additional to the feature of modeling at multiple depths, the pipeline includes the estimation of uncertainty according to the variable type. As suggested by Hengl et al. (2017), maps of uncertainty could be potentially very useful for planning new soil surveys. Finally, in terms of computing performance, table 7 reports the processing times of the pipeline for modeling pH 0.00-0.30 m from the covariate selection step with and without parallel processing. This parallel processing feature provides substantial gains for both covariates selection and modeling steps.

CONCLUSIONS
This investigation successfully deployed a semi-automatic and sequential pipeline, programmed in the R software, named dinoSoil-toolbox, to generate soil-related information in digital formats. As it was shown using legacy soil data from a mountainous valley terrain in Colombia, the proposed pipeline facilitates a rapid mapping of quantitative and categorical soil properties. By providing multiple outputs such as tables, charts, maps, and geospatial data in four main steps, the pipeline offers considerable robustness to support outcomes and analysis of a DSM project. These components are aligned to the recommendations by Wadoux et al. (2020a) of plausibility, interpretability, and explainability in ML-DSM developments that enable soil scientists to couple model prediction with pedological explanation and understanding of the underlying soil processes. Furthermore, we released the pipeline on a public GitHub repository (https://github.com/ acocac/dinoSOIL-toolbox) with readable documentation and facilitating reproducibility by making available the input data presented in this research.
Future studies might be interesting to implement further ML frameworks such as ensembles and deep learning models, which have shown a high performance for DSM. Moreover, while the pipeline was designed under an open science framework for users with relative knowledge in R, it would be helpful to design a GUI that facilitates its use for non-programming experts.