Spatial multivariate optimization for a sampling redesign with a reduced sample size of soil chemical properties

ABSTRACT Precision agriculture can improve the decision-making process in agricultural production, as it gathers, processes and analyzes spatial data, allowing, for example, specific fertilizer application in each location. One of the proposals to deal with spatial heterogeneity of the soil or the distribution of chemical properties is to define application zones (homogeneous subareas). These zones allow reducing both spatial variability of the yield of the crop under study and of the environmental impacts. Considering the soil data, application zones can also represent strata or indicators to direct future soil sampling, thus seeking sample size reduction, for example. This study aimed to obtain an optimized sampling redesign using application zones generated from the assessment of five clustering methods (Fuzzy C-means, Fanny, K-means, McQuitty and Ward). Soil samples were collected in an agricultural area located in the city of Cascavel-Paraná-Brazil, and analyzed in the laboratory to determine the soil chemical properties, referring to four soybean harvest years (2013-2014, 2014-2015, 2015-2016 and 2016-2017). The application zones were obtained through a dissimilarity matrix that aggregates information about the Euclidean distance between the sample elements and the spatial dependence structure of the properties. Subsequently, an optimized sampling redesign, with reduction of the initial sample points, was obtained in these application zones. For the harvest years under study, the K-means and Ward clustering methods efficiently defined the application zones, dividing the study area into two or three application zones. Among the reduced sample configurations obtained by the optimization process, when comparing the initial sample configuration, the one optimized by 25 % (selecting 75 % of the initial configuration points, which corresponds to 76 sample points) was the most effective in terms of the accuracy indices (overall accuracy, Kappa, Tau). This fact indicates greater similarity between the thematic maps of these sample configurations. In this way, the reduced sample configurations could be used to generate the application zones and reduce the costs regarding the laboratory analyses involved in the study.


INTRODUCTION
Precision Agriculture (PA) can improve the decision-making process in agricultural production.Differently from traditional agriculture, PA allows specific fertilizer application, irrigation or amendments in each location, that is, at a variable rate.Consequently, its use can contribute to improving yield efficiency and reduce environmental impacts (Ortega and Santibáñez, 2007;Bottega et al., 2017).
In addition to enabling a reduction of contaminants and maximization of agricultural productivity, proper soil handling is directly related to knowledge of the soil attributes' spatial variability (Barbosa et al., 2019).This spatial variability of georeferenced variables can be studied by means of Geostatistics techniques, which also make it possible to determine the degree of spatial dependence between the sample elements in the region and describe the spatial dependence structure of the georeferenced variable in the entire area, thus elaborating the thematic maps (Cressie, 2015;Uribe-Opazo et al., 2021).
One of the proposed ways to deal with the soil spatial heterogeneity and soil chemical properties of the agricultural area is defining management zones (MZs), which is nothing more than delimiting the study area into subareas with similar characteristics, i.e., spatially homogeneous subareas according to certain variables/attributes.With this, it is possible to manage each subarea uniformly and with a similar amount of fertilizers, enabling a more viable strategy for localized management (Rodrigues Jr et al., 2011;Galambošová et al., 2014).As the MZs are generated to be used in several harvest years, it is recommended to use soil variables that do not vary significantly over time, such as topographic data (elevation and slope) and soil physical data (bulk density, soil texture, soil penetration resistance -SPR) (Aikes Jr et al., 2021).When the farmer only has data on soil chemical properties, application zones (AZs) for variable rate fertilizer application recommendations can be generated (Molin, 2006;Aikes Jr et al., 2021).
The difference between MZs and AZs is related to the available variables (stable or unstable) and to the intended use time of the zones (long-term use or merely for future fertilizer application).On the other hand, spatial statistics and multivariate cluster analysis are methods that can be used by both zones.In addition, these zones can represent indicators or strata for soil samplings, reducing the number of samples that need to be collected to perform the soil and crop analyses (Gavioli et al., 2019).Another methodology that allows reducing the number of sample points in a spatial variability study, i.e., spatially characterizing a property by studying its distribution and dispersion in the agricultural area and, consequently, raising the crop productivity level (Landim and Yamamoto, 2013), is the one related to the optimization processes, such as the Genetic Algorithm (GA).
The GA can be used to solve optimization problems found in the real world; it consists of an optimization technique based on the evolution and adaptation process of individuals in a population and intends that only those fit remain in the population, constituting a solution to the problem, i.e., it consists of an iterative process, starting with a population of individuals, which are the possible solutions to the problem.The individuals are evaluated to select the fittest, according to an objective function to be maximized or minimized.The selected individuals are recombined based on the genetic operators and, thus, a new population is generated.This process is carried out until finding the optimal solution or until reaching a stop criterion pre-established by the researcher; more details can be seen in Guedes et al. (2011) and Maltauro et al. (2019).
In the context of obtaining a reduced sample configuration, with a size previously fixed by the researcher and considering univariate optimization processes, the hybrid GA showed that a 50 % sample size reduction produces effective results for the classification of the potassium fertilizer in the area (Guedes et al., 2011).Optimizing the efficiency of the geostatistical model estimation and based on Fisher's information matrix (objective function) as well as the estimation of the values predicted by kriging, maximizing Overall Accuracy (OA, objective function) measure, Maltauro et al. (2019;2021), respectively obtained that the GA was efficient in sample size reduction, determining a sample size with 29.41 to 39.22 % of the initial sampling points for the soil chemical properties.
In this study, we seek a joint definition of AZs and to determine an optimized sampling redesign (a new reduced configuration) with a reduced size using the GA.Thus, AZs allow to collect more samples in areas with greater variability (heterogeneous areas) and reduce such numbers in more homogeneous areas (Rodrigues Junior et al., 2011).This study aimed to obtain a spatial multivariate optimized sampling redesign with reduced sample size of an agricultural area using application zones as a way to stratify the agricultural area, as well as the optimized process called GA.

Study area and soil data
Soil samples were collected in an agricultural area located in the city of Cascavel, Paraná State, and the following chemical properties were determined: Al, Base Saturation [V], Ca 2+ , C, Cu, Fe, K + , H+Al, Mg 2+ , Mn, organic matter (OM), P, pH, sum of bases (SB), and Zn; and the Shoemaker, Mac lean and Pratt (SMP) index was calculated (Table 1).The agricultural area has 167.35 ha and is a commercial grain production area located at Fazenda Três Meninas at approximately 24.95°South and 53.37°West and with a mean altitude of 650 m above sea level.The soil is classified as Latossolo Vermelho Distroférrico típico (Santos et al., 2018) ou Oxisols (Soil Taxonomy), with a clayey texture.The region's climate is classified as mesothermal and super-humid temperate, climate type Cfa (Köeppen classification system), and the mean annual temperature is 21 °C (Aparecido et al., 2016).
The sample configuration or sampling design (arrangement of sampling points) used in this area is lattice plus close pairs (Chipeta et al., 2017), with 102 sampling points.This sampling was chosen because regular sampling allows for a uniform distribution of sampling points throughout the study area.This sampling design consists of a regular grid with a minimum distance of 141 m between the points.In some randomly chosen places, the sampling points were arranged at smaller distances (75 and 50 m between point pairs) (Figure 1).Adding nearby points minimizes estimation errors at smaller scales.Samples were located and georeferenced using a GNSS receiver (GeoExplorer, Trimble Navigation Limited, Sunnyvale, CA, USA) in a Datum WGS84 coordinate reference system, UTM (Universal Transverse Mercator) projection.
Soil sampling was performed in each point indicated (Figure 1).According to the recommendations found in the literature, four soil subsamples were collected at these points, from 0.0 to 0.2 m depth in the vicinity of the points (Arruda et al., 2014), mixed and placed in plastic bags with samples of approximately 500 g, thus comprising the representative sample of the plot.The chemical analyses were performed using the Walkley-Black method (Walkley and Black, 1934).
Considering the database from the Laboratory of Applied and Spatial Statistics (LEA and LEE) at UNIOESTE, the last consecutive harvest years used in this research (2013-2014, 2014-2015, 2015-2016, and 2016-2017) for which soil samples were collected and analyzed in the laboratory to determine the chemical properties.Only the soil chemical properties were used because the database does not have soil physical properties for all years.

Initial analysis
Considering all harvest years, descriptive and geostatistical analyses were performed for each soil chemical property to verify the existence of directional trends, spatial dependence and anisotropy (Figure 2a).Rev Bras Cienc Solo 2023;47:e0220072 2003).Anisotropy was assessed by analyzing the directional semivariograms (Guedes et al., 2018) and the non-parametric Maity and Sherman (2012), considering 5 % significance.
To understand and describe spatial dependence, the parameters of the geostatistical models were estimated: exponential, Gaussian and Matérn family with shape parameter k m = 2.5 using the maximum likelihood method (Uribe-Opazo et al., 2012), with choice of the best-adjusted model performed following the cross-validation method (Faraco et al., 2008) 1).
Subsequently, thematic maps corresponding to each soil chemical property were prepared considering the spatial prediction of each property at locations not sampled in the agricultural area under study, through ordinary kriging (considered an optimal interpolator, due to the way in which the weights are distributed, so that the estimator cannot be biased and must have minimum variance) and with pixels representing 10 × 10 m areas (maximum number of interpolated points that made it possible to implement cluster analyses) (Cressie, 2015).

Acquisition of the multivariate spatial dissimilarity matrix
Considering the soil chemical properties selected in each harvest year (Table 1), a dissimilarity matrix was used to generate the AZs, which aggregates information on the Euclidean distance between the sample elements and the spatial dependence structure of the properties.Consequently, all the locations were compared in pairs to obtain the spatial and multivariate dissimilarity matrix.For this, in each pair of i and j locations (i,j = 1, …, n) in which the p properties had already been measured (Table 1), the similarity coefficient proposed by Gower (1971) was calculated (Equation 1; Figure 2b).
To turn this similarity coefficient into a distance measure (Gower, 1971)  Rev Bras Cienc Solo 2023;47:e0220072 The dissimilarity matrix was obtained based on the methodological sequence described by Oliver and Webster (1989) (Figure 2b).
To modify equation 2 of this methodology (Figure 2b), in order to consider the geographic distances between the observations sampled and the spatial variability of the properties, a Principal Component Analysis (PCA) was performed in the original data to reduce dimensionality without losing the information contained in all properties.In this PCA, the first principal component (PC1) was selected, as it explains most of the data variation.
Considering the PC1 scores, the geostatistical models were adjusted analogously to the methodology used for the soil chemical properties in order to obtain an estimation of the parameters of the geostatistical model for the PC1 scores.With these parameters, the dissimilarity matrix D* was obtained (Equation 3; Figure 2b).In this way, the matrix adds information about the Euclidean distance between the sample elements (Landim and Yamamoto, 2013), as well as the spatial dependence structure of the properties (Uribe- Opazo et al., 2012).
Based on the calculations (Equations 5 to 9; Figure 2b) performed in the D* matrix elements, the G matrix columns were obtained (Equation 10; Figure 2b), which are the new variables (chemical properties of the soil after using the dissimilarity matrix).
In this way, one selects the number of ρ columns corresponding to the number of original attributes.Subsequently, a geostatistical model was adjusted and data interpolation of the values of the soil chemical properties was performed through kriging, with pixels representing 10 × 10 m areas.The interpolated values of the soil chemical properties were used to obtain the AZs in the agricultural area (Gavioli et al., 2016).

Spatial clustering of the agricultural area
Considering the multivariate spatial dissimilarity matrix and the most cited clustering methods in the literature, five methods were evaluated for the agricultural area's clustering, three of them hierarchical and two partitioned, namely: Fuzzy analysis clustering (Fanny), Fuzzy C-means, K-means, McQuitty and Ward (Ward Jr, 1963;McQuitty, 1966;MacQueen, 1967;Bezdek, 1981;Kaufman and Rousseeuw, 1990), respectively.Five evaluation criteria were used to select the method that provided the best data clustering, namely: Dunn Index (D), Davies Bouldin Index (DB), C Index, SD Index and Variance Reduction Index (VR) (Dunn, 1974;Hubert and Levin, 1976;Davies and Bouldin, 1979;Halkidi et al., 2000;Gavioli et al., 2016).
To define the adequate number of clusters for each harvest year, the scatter plots of the Sum of Squares of Errors (SSE) against the number of clusters (knee graph) were used in each clustering method, as well as the silhouette scatter plot against the number of clusters.These methods were used for their stabilization and for presenting satisfactory results in the literature (Shi and Zeng, 2013;Yi et al., 2013;Martarelli and Nagano, 2016).In the SSE graph, the mean distance decreases as the number of clusters increases.To find the optimal number of clusters, it is necessary to find the cluster with a sharp drop; therefore, this will be the sweet spot of the clusters.For the silhouette graph, the cluster that presents the highest value or peak of the graph is observed (Tan et al., 2009;Yi et al., 2013).Thus, with the optimal number of clusters and clustering method selected, AZ maps were generated for all harvest years considering the soil chemical properties that showed spatial dependence.

Optimized sampling redesign (sampling points selected by the GA with sampling reduction)
After dividing the agricultural area into AZs, a sample reduction process was carried out, selecting sampling points within each AZ for all the harvest years.Obtaining a new reduced sample configuration was considered an optimization problem (Guedes et al., 2014;Maltauro et al., 2019), as the objective of optimization is to seek the best solutions Rev Bras Cienc Solo 2023;47:e0220072 to achieve the objective of the problem; in the proposal of this article, the research was to obtain the best sample configurations with reduced sizes.The intention was to reduce the initial sample configuration by 25 and 50 % (Figure 2b), selecting the sampling points within each AZ.
The methodology developed by the GA to obtain the optimized sample configuration was similar to the one developed by Maltauro et al. (2021), only changing the objective function.To obtain an optimized sample configuration for each harvest years, considering all soil properties, it was decided to work with multi-objective optimization, in which it is possible to find viable solutions that simultaneously optimize several objective functions (Deb and Kalyanmoy, 2001).To such end, the Sum of Weights (SW) method was used, which consists of the sum of the objective function corresponding to each property, adjusted by a weight (Branke et al., 2008;Pantuza Junior, 2016).
In this study, optimization efficiency was evaluated based on spatial prediction.Subsequently, we considered a multi-objective function to be minimized using the SW method (Equation 11), based on a measure of similarity between the initial and optimized sample configurations of each of the soil chemical properties, called OA (Guedes et al., 2014;Maltauro et al., 2021) methodology.
Eq. 11 in which: x i is a possible sample configuration for the problem, with sample size i (i=1,2,⋯,n), in which n is the number of sampling points; w m = 1/p is the weight for each objective function the k-th soil properties (Equation 12) Eq. 12 with k = 1,…, p, in which p is the number of soil attributes, so that w k ∈ [0, 1], ∑ p k=1 w k = 1 and OA k (x) ∈ [0,1].Consequently, when minimizing F(x i ), which is the linear combination of the k objective functions, lower f k (x i ) values will be obtained, which corresponds to getting a higher OAk (x i ) values.
With each optimized sample configuration, the descriptive and geostatistical data analyses were performed again.Finally, the initial and optimized sample configurations were compared using metrics that express the similarity of the thematic maps obtained through kriging, namely: OA (Anderson et al., 2001) and the Kappa (Kp) and Tau (T) agreement indices (Krippendorff, 2013) (Figure 2b).

Computational resources
The routines for calculating the spatial and multivariate dissimilarity matrix, clustering, sample configuration, optimization and other statistical and geostatistical analyses were developed in the R software (R Development Core Team, 2022), considering the ClassInt, cluster, clusterCrit, data.table,e1071, fastcluster, geoR, psych, Splancs and vegan packages.

Initial sample configuration
For all the harvest years, the soil chemical properties presented dispersion of their values around the mean, or even homogeneous data.The Ca 2+ , C, Cu, Fe, H+Al, Mn, K + , Mg 2+ , OM and P contents, as well as pH and SB, had mean values considered average, high or very high for land-use.In turn, the Zn mean value can be classified as low or average, and the Al and V values were classified as very low or low.
Rev Bras Cienc Solo 2023;47:e0220072 For the directional trend, only the Zn content in the 2014-2015 harvest year presented a moderate linear association of its respective values with the X axis coordinates, with a Pearson's linear correlation coefficient value greater than 0.30 in a module.For all the harvest years, the soil chemical properties presented moderate (0.25< RNE <0.75) or strong (0.25≤ RNE) spatial dependence (Table 2).
Regarding the estimated value for the spatial dependence radius (practical range), the 2013-2014 and 2015-2016 harvest years presented greater variation, from 157.70 to 707.86 m and from 149.73 to 855.10 m, respectively, whereas the 2014-2015 and 2016-2017 harvest years exhibited less variation in the practical range, from 128.61 to 453.07 m and from 126.32 to 385.62 m, respectively.This variation in the practical ranges can be influenced by the chosen model and the sample size reduction (Table 2).

Spatial clustering of the agricultural area
For the 2013-2014, 2015-2016 and 2016-2017 harvest years, considering the scatter plots of the number of clusters versus the SSE and silhouette ones, the best number of clusters for all the clustering methods was k c = 2 since, with this number of clusters, the highest value of the Silhouette coefficient and the lowest SSE value were obtained.
In turn, for the 2014-2015 harvest year and for most of the clustering methods, the ideal number of clusters was k c = 3 (Figure 3).As for the interpretation of the chemical properties available in the soil within each AZ, it was noticed that all the soil chemical properties presented average, high or very high values for the soil in the state of Paraná, except for the Al and pH soil chemical properties, which were classified as low or very low.
Considering the evaluation criteria, K-means was the best clustering method for the 2013-2014, 2014-2015 and 2015-2016 harvest years since, in this method, the lowest values of the DB, C and SD indices were obtained, as well as the highest values of the D and VR indices (Table 3).Regarding the 2016-2017 harvest year, a tie was observed between the Ward and Fuzzy C-means clustering methods (Table 3).In addition to that, certain similarity was verified in the maps of the clusterings in relation to the definition of the AZs.Consequently, the Ward clustering method was selected for the 2016-2017 harvest year, as its execution is simpler and requires less computational time.
Differences are observed when comparing the AZs generated for each harvest year, which can be explained by the fact that different soil chemical properties are used in each harvest year.However, it is noted that a larger AZ was created in all harvest years (red color, Figure 4).In addition to that, it was observed that there is at least one AZ in the Southwest region in all the harvest years (Figure 4).And, except for the 2015-2016 harvest year, it was also possible to find at least one AZ in the North region (Figure 4).

Optimized sample configuration
Considering all harvest years, the sampling points were distributed across all AZs, with the largest zones presenting the highest number of sampling points, thus collecting a higher number of samples in areas with greater spatial variability, as well as reducing this number in more homogeneous areas.This result was also simultaneously influenced by the size of the AZs and by the uniform distribution of sample points over the agricultural area based on the original design (lattice plus close pairs).Thus, AZ1 covered 60, 32, 65 and 73 sampling points, which correspond to 59, 31, 64 and 72 % of the total points in the study area, respectively, for the 2013-2014, 2014-2015, 2015-2016 and 2016-2017 harvest years.In turn, AZ2 comprised 42 sampling points (41 % of the  total points of the study area), 26 sampling points (26 % of the total points of the study area), 37 sampling points (36 % of the total points of the study area) and 29 sampling points (28 % of the total points in the study area) (Figure 5).In addition, in the 2014-2015 harvest year, AZ3 included 44 sampling points (43 % of the total points in the study area).The optimized sample configurations that removed 50 % of the initial sampling points (O50) obtained 51 sampling points, and those that removed 25 % (O25) of the initial sampling points had 76 sampling points distributed in the agricultural area (Figure 5).Greater reductions were not possible, as the number of sampling points would not meet the geostatistical analysis criteria, that is, having at least 30 pairs for the calculation of semivariances.
For all the harvest years, similarity in the descriptive statistics was observed between the O50 and O25 sample configurations and the initial sample configurations (Tables 2, 4, and 5).For the 2015-2016 harvest year, both optimized sample configurations to obtain the Cu content presented a directional trend in the Y direction (North-South) (r >30).For the 2014-2015 harvest year, the Zn content presented a directional trend in the X direction (East-West) (r >30) for the O50 and O25 sample configurations, unlike the initial sample configuration, which presented a directional trend in the Y direction (North-South) (r >30).
For the 2013-2014 harvest year, only the H+Al and Mn contents presented a change in the classification of the spatial dependence intensity, from moderate (0.25< RNE <0.75) to weak (RNE ≥0.75) or to strong (0.25≤ RNE) spatial dependence for the optimized sample configurations (Table 4).Regarding the 2014-2015 harvest year, the Al and Zn soil chemical properties presented a nugget effect for at least one optimized sample configuration; and the Ca content had strong spatial dependence (0.25≤ RNE) in the O50 sample configuration (Table 4).
For the 2015-2016 harvest year, all soil chemical properties presented moderate spatial dependence (0.25< RNE <0.75) in the initial sample configuration and in the O25 sample configuration, whereas in the O50 sample configurations, the Ca and Mn contents showed a pure nugget effect, SB indicated weak spatial dependence (RNE ≥0.75) and Zn evidenced strong spatial dependence (0.25≤ RNE) (Table 5).Finally, for the 2016-2017 harvest year, in the O50 sample configuration, the Ca 2+ and V contents had strong (0.25≤ RNE) and moderate (0.25< RNE <0.75) spatial dependence, respectively; in turn, the SB presented moderate spatial dependence (0.25< RNE <0.75) in the O25 sample configuration, and the pH had strong spatial dependence (0.25≤ RNE) in both optimized sample configurations (Table 5).
Comparing the thematic maps of the Ca 2+ , Mn, MO and Zn contents and that of V generated considering the initial and the O25 configurations, estimated OA values above 85 % were found, which indicates that the maps are similar; in other words, the maps prepared considering both configurations are similar in terms of distribution of the properties contents in the study area (OA ≥85 %) (Figures 6 to 9).Therefore, the O25 sample configuration could also be used to delimit the AZs, similarly to those generated with the initial sample configuration.
According to the estimated values of the Kp and T agreement indices, most of the soil chemical properties presented low or average accuracy, with values between 0.001 and 79.18 % (low accuracy if Kp and T <67 %, average accuracy if 67 % ≤ Kp and T <80 %); whereas the Ca 2+ , Cu, Mn, MO and Zn contents and pH, SB and V presented high accuracy, with values between 80.01 and 91.56 %, mainly for the T index (high accuracy if Kp and T ≥80 %) with the initial and the O25 configurations.In turn, the Al presented high accuracy when comparing O25 and O50 to the initial configuration (Figures 6 to 9).This shows no relevant differences in the spatial prediction of these properties in the area under study, described by the thematic maps.

Clustering of the agricultural area
When comparing the clustering methods, it was observed that the Fanny method was not selected in any of the criteria, and requires more computational time compared to other partitioned methods (Rajkumar et al., 2019).Among the hierarchical methods, the only one that was chosen for a given harvest year was Ward's, which is in line with the CV: coefficient of variation; μ̂ = β 0 : estimated mean, φ1: estimated nugget effect; φ2: estimated contribution; â: estimated practical range; R_ NE: estimated relative nugget effect (R_ NE = φ1/φ1 + φ2) (%); for attributes that showed a directional trend μ̂ = β 0 + β 1 Y 1 , where β ̂0 (first value of the mean column), β ̂1 (second value of the mean column): estimated values of the parameters of the regression model and Y 1 represents the directional trend identified; Exp.: Exponential model; Gaus.: Gaussian model; M. km=2.5:Matérn model with smoothness parameter km=2.5; the units of measure of soil chemical properties are found in Table 1.
result obtained by Ossani et al. (2020); by analyzing clusterings in a coffee plantation, the authors verified that regardless of the distance considered, this method stood out among the other hierarchical methods.Dobermann et al. (2003) showed that the Ward method is one of the algorithms that provides the best results, analyzing various configurations of the input data.In addition, Freitas et al. (2014) and Santos et al. (2015) showed that the Ward method was also efficient in verifying similarities or differences based on the chemical and physical properties of the soil; in addition, integrated with the characterization of the soil properties' spatial variability, this method was effective in defining MZs.
Comparing the K-means and Fuzzy C-means methods to analyze the performance of various segmentation techniques for color images, Jipkate and Gohokar (2012) concluded that K-means clustering produces better results and computational times.The K-means method was also efficient for delimiting MZs from the interpolated variability maps and for coffee, based on determinations carried out with a chlorophyll sensor and by leaf analysis (Rodrigues Jr et al., 2011;Alves et al., 2013).The study only considered chemical properties for the generation of zones.Ortega and Santibáñez (2007) also used soil chemical properties to evaluate three zoning methods and quantitatively determine the relationships between the methods evaluated.
The same result regarding the generation of two or three zones was also obtained by Barbosa et al. (2019) and by Breunig et al. (2020) when analyzing different numbers of AZs for various harvest years in grain production areas.In the definition of AZs, obtaining a small number of zones renders the application of localized management practices more economically viable, mainly due to greater simplicity in the subdivision of the production field (Carvalho et al., 2016).Then, as the number of AZs increases, they end up becoming increasingly irregular, which leads to difficulties managing them due to technical and economic limitations, as small zones can become unmanageable.

Optimized sample configuration
Similarity in the descriptive statistics was observed between the O50 and O25 sample configurations and the initial sample configuration; this result was also found by Maltauro et al. (2019;2021) and by Dal'Canton et al. (2021), obtaining similar sample reductions even when working with different methodologies to obtain a sample reduction; in addition, these research studies were developed in the same agricultural area, considering different soil chemical properties.Thus, it can be stated that the reduced sample configurations are representative due to the similarity obtained.
The few cases in the optimized configurations that showed weak spatial dependence and a pure nugget effect might be influenced by the sample size reduction, as low concentrations  in samples might lead to an overestimation of the nugget effect (Hofmann et al., 2010).Most of the optimized configurations remained spatially dependent.One of the factors that may have contributed to this is the maintenance of the close pairs of points that lattice plus close pairs sampling provides an optimized sampling.In fact, these pairs of close points make it possible to more accurately estimate the nugget effect and minimize sampling error on a small scale (Hofmann et al., 2010).
Disregarding the soil chemical properties that presented weak spatial dependence and/or a pure nugget effect, the spatial dependence radium of the initial and optimized sample configurations was compared.A smaller variation in the practical ranges (upwards or downwards) considering the initial and O25 sample configurations (Tables 4 and 5) was observed, with this variation in the practical ranges influenced by the model chosen (Diggle and Ribeiro Jr, 2007).Such being the case, the O25 sample configuration presented the best estimate for the spatial dependence radium values when compared to the initial sample configuration (Tables 4 and 5).
Furthermore, for most of the soil chemical properties, when compared to the initial sample configurations, O25 presented higher values for the accuracy indices (Figures 6  to 9).This fact was already expected, as this optimized sampling contains more sampling points compared to the one with a 50 % sample reduction.In the practice, there is a certain similarity between the thematic maps generated with the initial and optimized sample configurations; therefore, the initial sample configuration or the optimized sample configuration could be used for the localized application of inputs.Furthermore, in both optimized sample configurations, it was possible to observe that the spatial variability pattern is maintained in most classes of the thematic map of the soil chemical properties, a fact also observed by Maltauro et al. (2019) and by Dal' Canton et al. (2021), working with sample reductions in an area of grain plantations.
As the soil chemical properties (macronutrients and micronutrients) are important for the development and growth of plants, it is necessary to know their availability in the soil; however, macronutrients (Ca, K, Mg, and P) are elements that plants need in high amounts, while micronutrients (Cu, Fe, Mn, and Zn) are the ones they need in smaller amounts and are absorbed in the form of cations (Mendes, 2007;Oliveira, 2007).Carbon plays numerous roles in the formation of biomass and plant metabolism, being necessary for plant growth as well as OM, which makes the soil richer in nutrients (Lopes, 1998;Ferreira et al., 2014;Assad et al., 2019).As for V, dystrophic soils (V <50 %) apparently have a lower ability to yield nutrients to plants when compared to eutrophic soils (V ≥50 %) (Mendes, 2007).
As for the macronutrients, it is estimated that the P utilization rate by the plant is from 20 to 40 %, with 80 to 95 % being fixed to the soil (Oliveira, 2007).It is absorbed in anionic forms, presenting a strong covalent bond with the O atom, maintained even after incorporation into plant tissues (Mendes, 2007).As for the K attribute, it is estimated that its utilization rate by the plant varies from 50 to 70 %, with K losses occurring in the soil due to leaching and water erosion (Oliveira, 2007).It is absorbed by plants in the ionic form of K + , and absorption depends on the diffusion of the element through the soil solution and mass flow (Villar, 2007).In the same way as P, during assimilation, its redox state does not change, remaining in the same ionic form in which it was absorbed (Mendes, 2007).Calcium and Mg are absorbed by plants as Ca 2+ and Mg 2+ and are found at high levels in the soil solution, as root interception attends to a considerable absorption percentage, while the mass flow supplies the rest.The Cu addresses the same process.All three processes supply the Fe and Zn: root interception, mass flow and diffusion (Lopes, 1998;Villar, 2007).
Therefore, by interpreting the chemical properties available in the soil within each AZ described in this paper, it was possible to observe that almost all the soil chemical properties presented high or very high average values for the soil in the state of Paraná.However, the Al and pH were classified as low or very low (Oliveira 2007;Pavinato et al., 2017).The Al content in the soil exerts a beneficial effect on the plant when it is supplied in low concentrations (Mendes, 2007).Soil pH is one of the most important factors influencing the availability of nutrients to plants; however, low pH values in the soil indicate greater soil acidity, which affects plant growth (Lopes, 1998;Oliveira, 2007).
One solution to reduce soil acidity is to apply lime in the study area (Lopes, 1998).Therefore, the SMP index is used to correct the soil with liming recommendations: the method consists in adding a volume of buffer solution to the soil sample, with the pH Rev Bras Cienc Solo 2023;47:e0220072 reading in the suspension of the sample representing the SMP index (Shoemaker et al., 1961;Lopes, 1998;Villar, 2007).For most soil chemical properties, these results agree with those obtained for the original sample configurations.
Therefore, at these sampling points, localized application of inputs can be carried out according to the need for each soil chemical property.As the amounts of macronutrients and micronutrients contained in the soil and their availability depend on several factors, and mainly on the interaction between them, it is known that the nutrients present in the soil are not always available (or easy to be absorbed -due to strong chemical bonds between nutrients and soil) to the plant.Thus, the fertilizer is in a format of easy availability of nutrients to the plant.In this way, fertilization is also carried out in a minimal amount (for nutrients that are in excess), only applying the amounts that the plant needs to absorb to develop (Santos and Silva, 2010).The methodology presented in this study can only be used to redefine a sample configuration that already exists in the study area and, therefore, cannot be used for new samples.

CONCLUSION
For all the harvest years, the clustering methods were efficient for defining the application zones (AZs) and, for the 2013-2014, 2015-2016 and 2016-2017 harvest years, the best number of clusters for all the clustering methods was k c = 2.For the 2014-2015 harvest year and for most of the clustering methods, the ideal number of clusters was k c = 3. Considering the evaluation criteria, K-means and Ward were the best clustering methods.Therefore, from a practical point of view, it is concluded that the AZs allow for localized application of inputs in the agricultural area.
Optimized sample configurations can be obtained by 50 and 25 % with the Genetic Algorithm (GA).However, among the sample configurations, when compared to the initial sample configuration, the O25 optimized sample configurations presented the best estimates for the spatial dependence radius values and the highest values for the accuracy indices, indicating greater similarity between the thematic maps of these sample configurations.The AZs effectively obtained an optimized sample configuration with 75 % of the sampling points of the agricultural area.
Therefore, this study showed that the reduced sample configurations allowed reducing the number of soil samples required and, consequently, the costs inherent to carrying out laboratory analyses, in addition to achieving efficiency in the analysis of the special variability of properties and yields.

Figure 1 .
Figure 1.Agricultural area and sampling points.

Figure 2 .
Figure 2. Methodology to obtain the optimized sample configurations (a) and the new matrix of properties from a dissimilarity matrix (b).

Figure 4 .
Figure 4. Thematic maps with the best number of application zones and clustering method chosen for each harvest year.Application zone 1 A pplication zone 2 A pplication zone 3

Figure 6 .
Figure 6.Thematic maps of the soil chemical properties considering the initial and optimized sample configurations and estimated values of the OA, Kp, and T similarity measures for the 2013-2014 harvest year.

Figure 7 .
Figure 7. Thematic maps of the soil chemical properties considering the initial and optimized sample configurations and estimated values of the OA, Kp, and T similarity measures for the 2014-2015 harvest year.

Table 1 .
Soil chemical properties and SMP index used in the study, indicated with an X

Table 2 .
Descriptive statistics and estimated values of the geostatistical model parameters for the soil chemical properties and SMP index, referring to each harvest year and considering the initial sample configuration CV: coefficient of variation; μ̂ = β 0 : estimated mean; φ̂1: estimated nugget effect; φ̂2: estimated contribution; â: estimated practical range; R_ NE: estimated relative nugget effect (R_ NE = φ1/φ1 + φ2) (%) for properties that showed a directional trend μ̂ = β 0 + β 1 Y 1 , in which β ̂0 (first value of the mean column), β ̂1 (second value of the mean column): estimated values of the parameters of the regression model and Y 1 represents the directional trend identified; Exp.: exponential model; Gaus.: Gaussian model; M. km=2.5:Matérnmodel with smoothness parameter km = 2.5; the units of measure of soil chemical properties are found in table 1.Rev Bras Cienc Solo 2023;47:e0220072

Table 3 .
Evaluation measures according to the clustering method used to generate application zones D: Dunn Index; DB: Davies Bouldin Index; C: C Index; SD: SD Index; and VR: Variance Reduction Index.The best results of the indices are highlighted in bold type.

Table 4 .
Descriptive statistics and estimated values of the adjusted geostatistical model parameters for the soil chemical properties and with the sample configurations optimized by 50 and 25 %, referring to the 2013-2014 and 2014-2015 harvest years CV: coefficient of variation; μ̂ = β 0 : estimated mean; φ1: estimated nugget effect; φ2: estimated contribution; â: estimated practical range; R_ NE: estimated relative nugget effect (R_ NE = φ1/φ1 + φ2) (%) for properties that showed a directional trend μ̂ = β 0 + β 1 Y 1 ; in which β ̂0 (first value of the mean column), β ̂1 (second value of the mean column): estimated values of the parameters of the regression model and Y 1 represents the directional trend identified; Exp.: exponential model; Gaus.: Gaussian model; M. km=2.5:Matérnmodel with smoothness parameter km=2.5; the units of measured soil chemical properties are found in Table1.Rev Bras Cienc Solo 2023;47:e0220072

Table 5 .
Descriptive statistics and estimated values of the adjusted geostatistical model parameters, for the soil chemical properties and SMP index with the sample configurations optimized by 50 and 25 %, referring to the 2015-2016 and 2016-2017 harvest years