Fine-scale soil mapping with Earth Observation data: a multiple geographic level comparison

Multitemporal collections of satellite images and their products have recently been explored in digital soil mapping. This study aimed to produce a bare soil image (BSI) for the São Paulo State (Brazil) to perform a pedometric analysis for different geographical levels. First, we assessed the potential of the BSI for predicting the surface (0.00-0.20 m) and subsurface (0.80-1.00 m) clay, iron oxides (Fe2O3), aluminum (m%) and bases saturation (V%) contents at the state level, which are important properties for soil classification. In this task, legacy soil samples, the BSI and terrain attributes were employed in machine learning. In a second moment, we evaluated the capacity of the BSI for clustering the landscape at the regional level, comparing the predicted patterns with a legacy semi-detailed soil map from a smaller reference site. In the final stage, the predicted soil maps from the state level were investigated at the farm level considering several sites distributed across the São Paulo state. Our results demonstrated that clay and Fe2O3 reached the best prediction performance for both depths at the state level, reaching a RMSE of less than 10 %, RPIQ higher than 1.6 and R of at least 0.41. Additionally, the predicted landscape clusters had a significant association with the main pedological classes, subsurface color, soil mineralogy and texture from the legacy semi-detailed soil map. Illustrative examples at the farm level indicated great capacity of BSI in detecting the variations of soils, which were linked to several soil properties, such as texture, iron content, drainage network, among others. Therefore, this study demonstrates that BSI is valuable information derived from optical Earth Observation data that can contribute to the future of soil survey and mapping in Brazil (PronaSolos).

In this regard, Earth Observation (EO) data are explored as model predictors due to their widespread availability and ability to cover large geographical areas. However, when employing DSM over huge areas with high or medium resolution EO data, the project may be limited by the high computing processing requirements and storage capacity. Especially in large countries, DSM has important obstacles to overcome. We can also mention the practical and economical limitations of sampling strategies that seek to improve the representation of soil diversity, as well as the lack or poor relationship of some environmental information with soils when calibrating prediction models.
Generally, EO products have been used as covariates for inferring soil spatial distribution, despite the continuous vegetation cover in both arable lands and natural forests (Chabrillat et al., 2019). For example, the Manual Técnico de Pedologia (Technical Manual of Pedology), which describes some guidelines for soil survey and mapping in Brazil, indicates that remote sensing products are useful for soil mapping if coupled with the intensification of field surveys (IBGE, 2015). However, different from this view, multitemporal collections of satellite images increased its value in recent years, as these products provide useful insights related to soil properties (Chabrillat et al., 2019). This can support both the quantitative and qualitative interpretation of soils at the landscape level. However, multitemporal EO data in the optical domain can only assist the soil surveying and mapping if coupled with appropriate transformation or preprocessing techniques.
Many automated processing algorithms have recently been developed to produce soil reflectance maps from optical EO data (Diek et al., 2017;Demattê et al., 2018;Rogge et al., 2018). In Brazil, the Geospatial Soil Sensing System (GEOS3) has been used to map soil properties Gallo et al., 2018;Poppiel et al., 2019a) and infer the spatial distribution of soil and lithological classes Poppiel et al., 2019b;Rizzo et al., 2020). The algorithm is based on the retrieval of historical soil exposures to the satellite measurements, which are aggregated into products namely Synthetic Soil Image (SYSI) and Soil Frequency (SF). These products reveal the direct multispectral response of soils and the cumulative alterations of soil surface for a given location, which can be correlated with soil properties and land intensification , respectively. However, despite the importance and developments in the area, there is a lack of knowledge regarding the impact of such analysis in multiple geographic levels. Our hypothesis is that a bare soil image derived from multitemporal collections of satellite images can be employed in pedometric analysis at different geographical levels, despite its standard spatial resolution of 30 m.
This paper aimed to produce a direct representation of soil reflectance for the São Paulo State in Brazil using the historical Landsat image collections. With the spectral features derived from EO data, we aimed at assessing the relationship of the bare soil reflectance with soil properties commonly used for soil taxonomic classification at the state level, i.e., clay, iron oxides (Fe 2 O 3 ), aluminum saturation (m%) and bases saturation (V%), Rev Bras Cienc Solo 2021;45:e0210080 by calibrating machine learning models for spatial prediction. Further, we also aimed to study soil reflectance images' potential for clustering landscape units over a reference site, comparing its results to a legacy semi-detailed soil map. Finally, the medium-resolution maps were assessed on different farms from São Paulo state. Thus, this study explores the potential of medium-resolution products derived from EO data for assisting the soil survey and mapping in Brazil (PronaSolos).

MATERIALS AND METHODS
We performed the study across the state of São Paulo, Brazil, mainly in agricultural areas due to the availability of bare soils from cropped lands. We produced a bare soil composite for the state level using the historical collections of Landsat images. We used this product to assess the major surface spectral patterns across the state and evaluate the quantitative relationship with some soil properties used for soil classification. First, we explored and discussed the method and its impact on the state level. In a second moment, we explored the bare soil composite over a smaller reference area near the Piracicaba municipality. In this step, we compared the soil units of a semi-detailed legacy map with landscape units generated from a clustering method. Then, we explored the results at the farm level to complete our assessment. In the next sections, the geographical areas and statistical analysis are described.

Soil properties mapping at the state level
The major region explored in this study is the Brazilian state of São Paulo, which represents an area of 248,222 km 2 (Figure 1). São Paulo has a relatively high relief, with 85 % of its landscape between 300 and 900 m above sea level. In the region, seven distinct climate types appear, being predominant the tropical climate of altitude (CWA) and savanna climate (AW). The CWA is characterized by rainy summer and dry winter, while the Aw is warm and has a minimal rainfall (Alvares et al., 2013).
About 30 % of São Paulo's territory is composed of a crystalline basement, dominated by igneous (granite) and metamorphic rocks. The remaining 70 % consist of deposits of the Paraná and Bauru basins, which resulted in sedimentary rocks from the glacial-interglacial cycle that occurred in the southern portion of the Gondwana paleocontinent. The Paraná basin was the scenario of intense tectonic-magmatic processes during the Mesozoic to Cenozoic period, resulting in basaltic flows, dike swarms and several alkaline complexes. Conversely, the Bauru basin is represented by continental sandstones and constitutes about 45 % of the territory, corresponding to a major part of the substrate of the Western region of the state. Intracontinental and coastal Cenozoic basins, as well as Quaternary sedimentary deposits, also occur in minor proportions (Garcia et al., 2018).
The pedodiversity is also noticeable. Based on the World Reference Base (WRB) soil classification system (IUSS Working Group WRB, 2015), Lixisols and/or Acrisols predominate in more undulating topography, mostly in the Western part. Ferralsols, Arenosols and Eutric Nitisols are mostly found in gentle slopes in relatively geomorphological stable positions, similar to what is found in some portions of the Western and Central regions of the state. Conversely, Cambisols, Leptsols and Regosols are typical of unstable positions of the landscape, like hilly slopes and floodplains. Gleysols, Podzols and Histosols are inherent in coastal plains and wetlands, where the water table is persistent (Oliveira et al., 1999;Rossi, 2017).

Bare soil reflectance and terrain attributes
The algorithm Geospatial Soil Sensing System -GEOS3  -developed inside the Google Earth Engine, was used to generate the multispectral bare soil composite from 1982 to 2018 using the Landsat collections of 30-m-resolution surface reflectance images (USGS, 2018a,b). The GEOS3 is a data-mining algorithm that extracts soil features from the historical collection of satellite images and aggregates the spatially sparse bare soil fragments into a denominated as Synthetic Soil Image (SYSI). The SYSI is the reflectance image, composed of six spectral bands: blue (450-520 nm), green (520-600 nm), red (630-690 nm), NIR (760-900 nm), SWIR1 (1550-1750 nm) and SWIR2 (2080-2350 nm). A set of rules was used to identify bare soil pixels in satellite images, based on spectral indices coupled with quality assessment bands to remove cloud, cloud shadow, inland water, snow, photosynthetic vegetation and non-photosynthetic vegetation (crop residues). Bare soils were flagged when the pixel value had the Normalized Difference Vegetation Index (NDVI) values falling between the range of −0.05 and 0.30 (masking out green vegetation), and Normalized Burn Ratio 2 index (NBR2) values between the range of −0.15 and 0.15 (masking outcrop residues). The pixels identified as bare soil were marked as valid for aggregating the multitemporal collection by their median value. Thus, GEOS3 produces an almost-continuous representation of surface reflectance, increasing the mapped area of the soil surface by averaging the multitemporal measurements. Detailed information about GEOS3 and SYSI, as well as spectral indices and sensitivity analysis, can be found in previous studies Fongaro et al., 2018;Gallo et al., 2018). In addition, the Digital Elevation Model (DEM) derived from the Shuttle Radar Topography Mission (SRTM) with a resolution of 30 m was used to calculate the slope, northness, eastness, horizontal curvature, vertical curvature and shape index of terrain (Safanelli et al., 2020a). This terrain information was used as additional covariates in prediction models.
In this study, a few key soil properties explored by the Brazilian Soil Classification System -SiBCS (Santos et al., 2018) were selected for further analysis: clay content, total iron content (Fe 2 O 3 ), base (V%) and aluminum saturation (m%). Clay content is used, for example, to evaluate the textural gradient between the surface and subsurface horizon, Rev Bras Cienc Solo 2021;45:e0210080 which is a diagnostic attribute of Lixisols. The iron oxides content (total iron) is a diagnostic attribute for classifying soils at the third categorical level, which is related to color and mineralogical features. The saturation of aluminum or bases (Ca 2+ , Mg 2+ , K + , Na + ) in the sorption complex of soils are used to distinguish the soil chemical fertility at the third categorical level of SiBCS.
Soil samples had their chemical and particle size analysis following Brazilian standards. The densimeter method was used to obtain the soil particle size distribution (clay, silt and sand), whereas the chemical properties (V% and m%) and Fe 2 O 3 were determined by ion-exchange resin method and sulfuric acid extraction, respectively, according to Raij et al. (2001). Although the soil data is composed of several legacy surveys performed in the last years, which can affect the performance of prediction models, they still provide a baseline of the soil spatial distribution across the São Paulo State.

Soil properties prediction models
Prediction models of clay, Fe 2 O 3 , V% and m% for the 0.00-0.20 and 0.80-1.00 m layers were applied in the study area. The Random Forests (RF) algorithm, which is an ensemble of many regression trees built with bootstrapped samples and randomized predictors (Breiman, 2001), was chosen to calibrate the soil prediction models. For this, we performed a grid search for defining the best hyperparameters for each soil layer and soil property. The number of trees of the forest (FS), the number of random predictors tested at splits of each tree (nRP), and also the minimum number of samples at the tree leaves (minSL) were tested.
The training set was determined by collecting bootstrapped samples of each regression tree, whereas the testing set was defined by the remaining samples (out of the bag) that were not selected at each bootstrapping. Bootstrapping is a method that samples observations with replacement to the same size of the dataset, covering, in general, around 67 % of the entire dataset. We applied the bootstrapping method up to 500 times, which is equivalent to the maximum number of trees tested for the random forest. The performance metrics were determined by the mean value calculated across the collection of bootstrapped trees for both the training and the test sets. The minimum Root Mean Squared Error (RMSE) of the training set was used for defining the best hyperparameters (FS, nRP, and minSL). The range of values tested for FS was 30, 60, 100, 200 and 500 trees. The values of 3, 5, 8, 11 and 13 predictors were investigated for defining the nRP: SYSI blue, SYSI green, SYSI red, SYSI NIR, SYSI SWIR1, SYSI SWIR2, elevation, slope, northness, eastness, horizontal curvature, vertical curvature, and terrain shape index (Table 1). Further, the values of 10, 20, 30, 40, 50, 100, 200 and 500 observations were tested for minSL.
The final model performance was evaluated by applying the optimized models to the test set. The prediction accuracy was evaluated using the RMSE. The Coefficient of Determination (R 2 ) was calculated to assess the explained variance of the prediction models, and the Ratio of Performance to Interquartile range (RPIQ) was calculated to assess the consistency between the predicted values with the variability of soil observations (Safanelli et al., 2020a). The evaluation performance was defined by averaging the metrics from the out-of-the-bootstrapped estimates.

Landscape units at the regional level
At the second stage of multilevel analysis, a smaller reference site was selected near Piracicaba municipality due to the availability of a semi-detailed soil map and expressive soil variability. The Piracicaba region is characterized by a complex geology, diverse topography, and the occurrence of different soil types (Figure 1). The region is classified as humid subtropical by the Koppen's classification system, characterized by dry winter and hot rainy summers. The average annual rainfall is approximately 1273 mm yr −1 and the average annual temperature is 24 °C. The landscape is gently undulating and rolling uplands, with slopes between 0 and 42 % (Gallo et al., 2018).
The geology includes fine to medium, reddish, silty-clay sandstone of the Botucatu and Pirambóia Formation; dike elements of basic intrusive diabase rocks of the Serra Geral Formation; sandstones from the Irati Formation (Passa Dois Group), where the relief is gently undulating to flat; and siltstones, that are the predominant lithology of the Itararé and Tatuí Formation (Tubarão group) . Thus, the soil variability is mainly due to the complexity and diversity of the parent material. Soil derived from diabase rocks, such as Ferralsols, are situated in the highest position of the relief, while Nitisols occur on smooth to gentle undulating relief. Soils derived from sedimentary rocks include Leptsols, Arenosols, Lixisols, which may be associated or not. On the flat relief near to the drainage system, Planosols and Gleysols are found (Gallo et al., 2018). In Piracicaba region (Figure 1), a smaller reference site (area of ~350 km 2 ) was selected to study the relationship of bare soil composite with soil mapping units of a semi-detailed pedological chart of the São Paulo state that corresponds to Piracicaba region. This map is a result of several surveys carried out in the area, providing the boundaries of soil units as a vector layer with a 1:100,000 scale (Rossi, 2017).

Principal components and k-means clustering analysis
The SYSI of the reference site was transformed by a principal component analysis (PCA) to reduce the spectral dimension and produce uncorrelated components. The first three components that accounted for 99 % of the spectral variability were employed to k-means clustering analysis to produce spatial units defined by their multispectral reflectance. The optimal number of clusters was defined when a decrease of the Akaike Information Criteria (AIC) was equal to or lower than 10 % from the range of 1 and 20 clusters. The AIC criteria was selected to define a parsimonious model for clustering the reference site, i.e., it was used to define the simplest landscape delineation with the least assumptions and number of clusters, but with the greatest explanatory capacity . The mean reflectance of each cluster was calculated for spectral characterization.
The landscape units defined by k-means clustering were compared to the legacy soil mapping units using the correspondence analysis, which evaluates the association of two categorical groups. The Chi-squared test was applied to assess the association strength between clusters and soil mapping units, and the relationship between the levels of both groups was evaluated from a contingency table and the respective diagram of principal coordinates. Thus, the relationships can be statistically and visually assessed .
For this analysis, the soil mapping units were interpreted according to their predominant soil class since a few units were composed of soil associations with ranked abundance. The dominant taxonomic class was decomposed in interpreted information representing the soil type, color, fertility, iron oxides content, and subsurface texture. The decomposition followed the first, second, and third levels of SiBCS, with the subsurface texture being used as an additional soil property. The levels from the SiBCS classification are based on the presence or absence of soil characteristics, diagnostic horizons or properties that can be identified at the field, which are resulted from a set of dominant processes that acted during soil formation. In table 2, we present the original and translated categorical levels decomposed from the dominant soil taxonomic class of soil units.

Farm-level assessment
Farm-level assessment was performed by linking the aforementioned products (SYSI, terrain attributes, and soil properties maps) with field observations. We went to eight sites distributed across the São Paulo State to qualitatively assess the bare soil image and its derived products. First, we wanted to evaluate the performance of the traditional toposequence sampling strategy for depicting the soil variation across the landscape. Without using the products derived from EO data, we distributed and made field observations with an auger in three different depths (0.00-0.20, 0.40-0.60 and 0.80-1.00 m) with subsequent soil analysis. This evaluation allowed the comparison of the toposequence sampling strategy with the SYSI and its derived maps.
In another stage, a different approach was tested. In this case, SYSI and its predicted maps were used to allocate observation points based on their spatial patterns. This would make it possible to detect features related to the drainage network detected by SYSI, which could be linked to redoximorphic features of Gleysols. Other sites with high variation of clay content at the farm level were also observed in situ to assess the discrimination potential of SYSI for smart sample allocation and soil survey. In the end, two detailed maps were delineated using all information available. This final analysis was decisive to complete the multilevel analysis of products derived from EO data.

Spectral features from EO data
The SYSI with 30 m resolution produced over the São Paulo State reveals different patterns that are related to soil and rock types ( Figure 2). In the true color composition (Figure 2a), SYSI is presented with brown shades that resemble the true color of soils. The false color of SYSI (Figure 2b), composed in part by infrared bands (R: SWIR1 G: NIR, and B: red), allows better visualization of the soil variability at the state level ( Figure 2c). It is also worthy to note the Southeastern region of the state is barely mapped with bare soils, possibly related to permanent land uses, such as natural forests, and places where metamorphic rocks prevail (Figure 2d).

Soil properties prediction at the state level
When SYSI was employed with terrain attributes for calibrating prediction models of soil properties, the visual aspects confirmed its association to clay and Fe 2 O 3 content ( Table 3). The aluminum and base saturation of the sorption complex of soils presented unsatisfactory performance (either R 2 <0.4 or RPIQ <1.4, and RMSE around 20 %) for 0.00-0.20 and 0.80-1.00 m layers. For clay and Fe 2 O 3 , the surface layer had the best performance metrics: the lowest RSME and highest R 2 values between the layers, reaching an explained variance (R 2 ) between 0.41 and 0.69, and RMSE of less than 10 %.
The visual aspect of the predicted maps of clay and Fe 2 O 3 (Figure 3) followed the same patterns of SYSI and rock types (Figure 2). The dark shades of both SYSIs had higher estimates for both clay and Fe 2 O 3 contents in both predicted soil layers. The places that predominate igneous rocks are also associated with regions of high clay and Fe 2 O 3 contents ( Figure 2). The distinct predicted layers also reveal that subsurface had higher Table 3. Performance evaluation of soil attributes prediction using out-of-the-bag validation (1) Fe 2 O 3 (%): total iron oxides content; m (%): aluminum saturation; V (%): bases saturation.
(3) n: number of available samples per soil attributes and layer.
(4) RMSE: root mean squared error. estimates for both soil attributes (Figures 3b and 3d), as demonstrated by the slightly darker shades of subsurface maps (0.80-1.00 m).

Landscape discrimination at the regional level
From the visualization of the state level, we went further in detail to understand the spatial patterns at regional level. However, it is worth mentioning that the spatial resolution of the bare soil image and its products is 30 m, changing only the visualization level. The SYSI of the reference site was firstly processed by PCA to reduce the spectral dimension and produce uncorrelated features. The first three components accounted for almost 99 % of the spectral variability, which were influenced by the overall intensity, the effect of the infrared region, and the contrasting effects of correlated bands.
In the clustering analysis, the first three components promoted the identification of seven spectral groups across the reference site. The spatial distribution of cluster revealed a relationship with some soil taxonomic classes found in the reference site (Figure 4). The regions dominated by Ferralsols (LV) and Nitisol (NV) coincided with the ones with darker shades in the true color composite of SYSI (Figures 4a and 4b) and clusters with low reflectance intensity (Figure 4c). Conversely, the high predominance of Regosols (RL) and Lixisols (PVA) in the opposite side of Ferralsols coincided with brighter colors of SYSI and the remaining clusters. The soil types of high reflectance intensity are situated in the lower parts of the reference site, while the Ferralsols and dark-colored soils prevail on the higher portions of the landscape (Figure 4a).
The spatial association of the spectral clusters with the pedological map is confirmed (p<0.01) by the correspondence analysis ( Figure 5). For the first classification level, the soil units composed of Nitisol, Ferralsol and Gleysol had a significant association to clusters number 2 and 3 (Figure 5a), which presented a lower reflectance intensity and are situated in the darker shaded area of SYSI (Figure 4b). Soil units composed of Cambisol, Regosol, Lixisol and Phaeozem are associated with clusters 4, 6 and 7. A similar association was found for the second categorical level, where the soil color and other general properties are defined (Figure 5b). Reddish soils are linked to cluster 2 and 3, while yellowish-red soils are linked to groups 5 and 7. The lithic and general qualifiers (Haplic) were grouped in the middle of the diagram close to clusters 1, 4 and 6 ( Figure 5b). For the qualifiers of soil fertility and iron content (Figure 5c), the third categorical level, it is possible to find an association of soils with high Fe 2 O 3 content to clusters of numbers 2 and 3, which have a lower reflectance. Finally, the subsurface texture also reveals the grouping of high clay content soils linked to the spectral groups 1, 2 and 3 ( Figure 5d).

Farm-level assessment
The illustrations and observations at the farm level revealed that SYSI and its related products provided great assistance to soil survey and mapping, despite its limited resolution ( Figure 6). Figure 6a demonstrates that regular satellite images or aerial photographs present vegetation patterns, restricting the visualization of the soil surface. In a first example, after SYSI generation (Figure 6b), a drainage pattern was observed using the true color composition, since a pale color was detected within a reddish soil zone. The field observation performed at that site confirmed the presence of redoximorphic features (Figure 6c). Figure 6d represents an apparently homogeneous area by the regular satellite image. However, the SYSI false-color composite (Figure 6e) depicted the spatial variation across the entire area and the clay content could be quantified by a prediction model, which followed the same patterns of SYSI ( Figure 6f). As ratified by field observations, the area with light magenta color were classified with low clay content, while the darker ones were estimated with higher clay content. It is important to note that there Rev Bras Cienc Solo 2021;45:e0210080 was no continuous trend of these patterns along the terrain, and thus, the traditional inference of conventional soil mapping could fail in detecting those soil transitions. Indeed, as observed from figures 6d and 6e, the classical delineation following the terrain patterns is limited, while the information derived from EO data allows qualitative and quantitative analysis.
A more impacting change about the terrain variation is indicated in figures 6g and 6h.
As it represents a smaller area, the clayey soil does not follow the terrain trend, as the continuum is broken by a sandy soil. If a pedologist demarked the toposequences 1 and 2, he/she could surely detect the soil variation, but could not detect the same transition in position 3. Another good example is provided from figures 6i to 6l.
A strong relief characterizes the site represented in figures 6i to 6l with variable lithology (shale, sandstone, and basalt). With a regular aerial photograph or satellite image (Figure 6i), it is impossible to detect most of the landscape variations due to vegetation  and residues cover. When the elevation gradient is used (Figure 6j), a contrasting relief is revealed. However, only with the bare soil composite that a synoptic view of the soil landscape can be generated (Figure 6k), which brings an additional comprehension about the soil types. Integrating terrain information with SYSI patterns make possible the delineation of soil types with higher precision, as expressed in figure 6l with the resulting detailed map.
Figures 6m, 6n and 6o illustrate the differentiation of soils in a relatively homogenous area with flatter terrain. In this situation, the soil surface is depicted only by the bare soil reflectance, making it difficult for a pedologist to work with only elevation gradients. In fact, the locations with question marks reveal the places with patterns that seem to be related to the same terrain positions.

Predictions at the state level
The darker shades from both the true color and false composites of SYSI (Figure 2) illustrated the soils with a lower reflectance intensity in the visible channels (blue, green and red), which are related to clay, iron oxides and organic carbon . In this sense, the spatial variation revealed by the pedological map ( Figure 2c) presented a certain degree of association with the true and false-color compositions of SYSI, where the darker shades were linked to Ferralsols and the brighter ones to Acrisols, Lixisols, or Arenosols. Besides that, it is evident the correlation of SYSI darker shades to igneous rocks, in agreement with study of Bonfatti et al. (2020).
When SYSI was employed in machine learning, the performance metrics (Table 3) confirmed its association to clay and Fe 2 O 3 for both soil layers. These satisfactory results for clay and Fe 2 O 3 are in accordance with previous studies that used bare soil images and other environmental information for DSM (Fongaro et Safanelli et al., 2020a). In addition, the visual correspondence of the predicted maps of clay and Fe 2 O 3 with rock types (Figure 2) confirmed the association of soil reflectance with soil minerals and its granulometric composition (Dalmolin et al., 2005), which are important characteristics used for soil taxonomic classification. The poor prediction performance of chemical properties (m% and V%) found in this study is also present in other studies (Poppiel et al., 2019a;Rizzo et al., 2020;Safanelli et al., 2021). This can be linked to several factors, such as an insufficient calibration dataset, the spatial-temporal variation of these soil attributes, and the weak relationship of the environmental predictors tested in this study with the chemical soil properties.

Landscape patterns at the regional level
Principal component analysis was used before the clustering analysis to compact information and produce uncorrelated features, similar to what other studies have explored (ten Caten et al., 2011;Viscarra Rossel et al., 2011). Overall, the results of PCA followed the expected effects of iron oxides at the NIR and red bands, while soils with higher sand content were similarly grouped in high reflectance clusters (Dalmolin et al., 2005). In addition, the optimal number of spectral clusters was close to previous studies that similarly explored spectral segmentation methods (Viscarra Rossel et al., 2016;Demattê et al., 2019;Dotto et al., 2020). The clustering analysis applied to the regional level confirmed that reflectance patterns is a valuable proxy for soil discrimination and quantitative analysis (Nanni et al., 2004;Bellinaso et al., 2010).

Soil landscape patterns at the farm level
Field observations confirmed that SYSI can express the irregular soil variations across the landscape ( Figure 6). It is worth noting that there is a gain of information from the regular satellite images or aerial photographs, which is increased by the terrain attributes and completed by a bare soil composite. However, the bare soil images do not give the last word, as the final representations of soil types are better achieved when merging all the available information. The detailed soil maps represented in figure 6 were delineated by integrating field observations, terrain attributes, and SYSI.
If the bare soil technique were not used, as demonstrated in figure 6, many places could probably not be detected by the pedologist in the field. This happens because the pedologist uses his/her tacit knowledge with terrain inference, while the bare soil image can produce a synoptic view about soil surface reflectance. I turn, the main limitation is that SYSI only represents the surface variation, and for some specific soil types, the subsurface can be contrasting. Thus, soils with textural gradient, such as Lixisols or Acrisols, could not be detected only with surface data.
Another point that can be discussed is that the bare soil images can provide information about soil variability before going to the field, assisting in this case, to the allocation of more assertive observations. This can save time, resources and improve the representation of soil variability of the final maps. Some soil spots revealed in figure 6 demonstrated that the surface variations are irregular and not detected by terrain patterns. Thus, the illustrations of figure 6 and figure 3 of SD confirm that smarter sampling strategies using bare soil images or derived products could improve the soil survey and soil mapping.

Some considerations regarding the PronaSolos
Brazil, as a continental country, requires detailed soil information for its sustainable development. In this aspect, PronaSolos has gained attention and can become one of the most outstanding projects of soil survey and mapping, which will impact the future of Brazilian agriculture and other environmental sectors (Polidoro et al., 2016). Detailed and updated information on soils brings a myriad of benefits to our society. Our results reveal the potential of an effective method to depict the soil spatial variability using computational systems and EO data by producing a synthetic soil image (SYSI). In this sense, capacitation courses (e.g., Universidade PronaSolos), free and open access datasets and products (e.g., febr) and software (e.g., Python, R), collaborative work and others, will be necessary to increase the assimilation of these technologies in future works Samuel-Rosa et al., 2020;Costa et al., 2021).
Another point that must be addressed is the integration of new technologies and products with the legacy methods of soil survey and mapping. Since the soil reflectance comes from EO satellite sensors, the observation of the soil surface across the landscape becomes synoptic. In the 1960s, EO was performed by aerial photographs that indicated the differences between soils without looking at the subsurface but making inferences from the relief. In the 1970s, pedologists from Radam project used radar sensor on board of airplanes to relate the image results to soil variability (IBGE, 2015). Thus, the present technique based on multispectral reflectance is an evolution of these legacy methods and can be assimilated or integrated. The multitemporal reflectance of SYSI, which is closely related to several soil properties and attributes, is an effective proxy that brings direct information of clay, iron and other relatively stable soil properties.
In turn, one also shall take into consideration the subsurface layers, for sure the most challenging target for EO satellite sensors. However, it is absolutely clear that EO products associated with field information can improve soil survey and mapping. In a recent study, Rossiter et al. (2021) performed a study to evaluate the predictive soil mapping (PSM) with geographic distribution of soil properties, where some limitations were identified. Although there still exist some discrepancies between PSM and field soil survey, the authors clearly addressed that PSM is still a valuable tool. They claimed PSM is more reproducible and objective, integrating field data with environmental covariates and learning algorithms, which can at least provide a useful pre-map for planning sampling and field survey. In this sense, the pedologist can go directly to key locations previously identified by SYSI or other landscape information, gaining time and reducing potential costs. At the end, the final decision is still made by the pedologists or soil surveyors by examining the soil at the landscape.

CONCLUSIONS
A direct representation of bare soil reflectance of São Paulo State was produced, representing most of the historical and recent cropped area. The relationship of this information was confirmed with clay and Fe 2 O 3 spatial distribution (surface and subsurface layer) by a satisfactory prediction performance when employed to machine learning with auxiliary spatial information (terrain). We also demonstrated the potential of bare soil image for clustering landscape units linked to soil taxonomic classes over a reference site in São Paulo. The predicted spatial clusters had a significant association with soil classes and their qualifiers, such as the main pedological class, subsurface color, soil mineralogy and soil texture. The potential of SYSI for soil survey and soil mapping was illustrated with several examples at the farm level. Therefore, this study demonstrates that bare soil images are valuable information derived from optical Earth Observation data, which can contribute to the future of soil survey and mapping in Brazil (i.e., PronaSolos).