Information‐Based Machine Learning for Tracer Signature Prediction in Karstic Environments

Karstic groundwater systems are often investigated by a combination of environmental or artificial tracers. One of the major downsides of tracer‐based methods is the limited availability of tracer measurements, especially in data sparse regions. This study presents an approach to systematically evaluate the information content of the available data, to interpret predictions of tracer concentration from machine learning algorithms, and to compare different machine learning algorithms to obtain an objective assessment of their applicability for predicting environmental tracers. There is a large variety of machine learning approaches, but no clear rules exist on which of them to use for this specific problem. In this study, we formulated a framework to choose the appropriate algorithm for this purpose. We compared four different well‐established machine learning algorithms (Support Vector Machines, Extreme Learning Machines, Decision Trees, and Artificial Neural Networks) in seven different karst springs in France for their capability to predict tracer concentrations, in this case SO42− and NO3−, from discharge. Our study reveals that the machine learning algorithms are able to predict some characteristics of the tracer concentration, but not the whole variance, which is caused by the limited information content in the discharge data. Nevertheless, discharge is often the only information available for a catchment, so the ability to predict at least some characteristics of the tracer concentrations from discharge time series to fill, for example, gaps or increase the database for consecutive analyses is a helpful application of machine learning in data sparse regions or for historic databases.


Introduction
Tracer-based methods are often the only way to separate stream flow components and to determine the origin of water (Kirchner, 2003;Klaus & McDonnell, 2013;Mei & Anagnostou, 2015;Mewes & Oppel, 2019;Rimmer & Hartmann, 2014;Weiler et al., 2017). Especially in karstic environments, tracer investigations allow a deeper understanding of the underlying karstic system and the interdependencies of discharge and the current state of the subterraneous processes or storages (Aquilina et al., 2005;Gur et al., 2003;Lee & Krothe, 2001).
The joint analysis of tracer data and discharge measurements is a common tool to derive information about hydrological systems, for example, the identification of the origin of water within a catchment. Despite their advantages, these approaches demand long time series of tracer measurements covering a wide range of hydrological system dynamics (Garvelmann et al., 2017;Lee & Krothe, 2001). To describe catchments with hydrological models, the link between tracer signatures and the system's hydrological state is of interest to set up suitable calibration strategies. Although the dependency on tracer data in model studies is high, the information content of tracer measurements has rarely been analyzed. Furthermore, the information-tonoise ratio in the data has to be high to derive the desired information about the system (Kelleher et al., 2015). Another problem is the lack of available tracer databases that hinders many applications, especially in data sparse regions. Here, machine learning could be useful because of the core concept to predict values that are difficult to measure with input data that are straightforward to measure. If the algorithms are able to predict tracer concentrations from discharge time series, data-driven interpolations of continuous tracer concentrations time series can be obtained.
With the rise of machine learning technologies and further improvements in information technology, the application of new approaches for data analysis and the interplay of data, information content, and results have increased (Goodfellow et al., 2016;Kelleher et al., 2015). Machine learning is the umbrella term for processes that extract patterns from data automatically (Goodfellow et al., 2016). Machine learning-based algorithms are used in many hydrological applications (Raghavendra & Deka, 2014), like rainfall-runoff modeling with artificial neural networks (Hu et al., 2011;Nourani et al., 2009), precipitation forecasting (Yu et al., 2017), evapotranspiration prediction (Tabari et al., 2012), baseflow separation (Corzo & Solomantine, 2007), measurement setup design (Chacon-Hurtado et al., 2017), streamflow forecasting (Shortridge et al., 2016;Shrestha & Solomatine, 2006;Taormina et al., 2015;Yaseen et al., 2016), the separation of flood events from time series of discharge (Mewes & Oppel, 2019), water resource management (Fotovatikhah et al., 2018), and many more. In these studies, machine learning algorithms were mostly used to replicate a system and transform a certain variable into the future. Machine learning was found a useful tool to manipulate data in complex systems, like catchments, where the rules leading from input to output are not completely describable. For example, using a Multi-Layer-Perceptron neural network, dispersion of a tracer was evaluated for a small river in 1-D profile (Piotrowski et al., 2007).
For machine learning algorithms the information content of training data is important (Han & Kamber, 2010;Kelleher et al., 2015;Vapnik, 2013). The Shannon entropy is a common concept in information theory to analyze the information content of given data (Shannon, 1948; see also Fernando et al., 2009). Until now, no study tried to predict natural tracer concentrations in karstic environments from discharge dynamics by the application of machine learning algorithms to fill gaps between point measurements of tracer concentrations. This strategy was chosen, because discharge is often the only available data source with an appropriate temporal resolution for hydrological modeling at an event scale. In the database we used, some infrequent tracer concentration measurements were available as point measurements. A machine learning tool capable of filling these gaps would allow the application of databases of frequent discharge measurements and nonfrequent measured tracer concentrations. Additionally, an already trained algorithm could predict tracer concentrations for catchments in which only a limited number of discharge measurements is available used. Furthermore, it would qualify historic data for application in modeling approaches that require a higher temporal resolution of tracer measurements. In karstic environments, the joint analysis of tracer data is often the key for a deeper understanding of system states and behavior (Mudarra et al., 2019). Therefore, we assume a high information content in the measured tracer data because they describe the complex interaction of subterraneous processes. Machine learning algorithms depend on information provided in the data. Consequently, the available data sets of discharge and tracer measurements have to be analyzed on explanatory power, what has not been done before for a database of karst springs. Furthermore, an information content-based analysis of the interpolated tracer measurements can be conducted by comparing the prediction results with the information content of the input data.
In this study, we analyzed observed discharge and natural tracer data (sulfate, SO 4 2− , and nitrate, NO 3 − ) from seven different karst springs across Europe regarding their information content. We took natural tracers because they exist in varying concentrations and are measurable without any induced injection. We chose nitrate and sulfate because they represent different residence times in the system. While nitrate represents shallow fast flowing water, sulfate represents the opposite origin: slow phreatic processes. We applied different machine learning algorithms such as Support Vector Machines (SVM), Classification and Regression Trees (CART), Extreme Learning Machines (ELM), and Artificial Neural Networks (ANN), to estimate tracer concentrations from discharge dynamics. We selected those four machine learning approaches that (a) are well established in hydrology, (b) are used for pattern recognition in structured data sets, and (c) deliver to a certain degree interpretable structures for the researcher. Furthermore, we compared different concepts of prediction, including the univariate prediction that separately estimates each tracer with a specialized machine and the multivariate estimation that tries to predict a set of tracers with a combined machine. We tested each of the chosen approaches on the prediction capability in seven different catchments and created a strategy to build a data-driven interpolation tool set for the interpolation of continuous time series of tracer measurements. Finally, we linked the prediction results with the observed information content in the data as well as with the mutual information between the chosen tracers.

Methods and Data
Sound results from machine learning approaches require data with a high information-to-noise ratio. Moreover, the choice of the appropriate machine learning algorithm for this task is difficult to justify 10.1029/2018WR024558 Water Resources Research without understanding the internal structure of the problem. Following the No-Free-Lunch-Theorem, all available approaches should be equally suitable to solve this problem but with a different performance and different demands to the data in terms of amount and quality (Wolpert & Macready, 1997). Accordingly, without information for an a priori selection of the best machine learning approach to use, we chose four structurally different approaches to estimate tracer concentrations in seven catchments. To quantify the information content within the data set, we introduce concepts like continuous entropy and mutual information. After defining these basic concepts, we explain the choice of machine learning algorithms in this study and explain the further scheme of this application.

Entropy and Mutual Information
Shannon's model of entropy allows to quantify the amount of information gain by adding new data to the analysis (Shannon, 1948). The entropy H is defined by the chance of a sample X d to be of one of the given classes x 1 ; …; x Nd f gwith P(x n ) as the probability that X d = x n with a sample length N: Because Shannon's entropy is only valid for discrete data, the concept was extended to the continuous entropy for a continuous variableX c , which is in our case discharge: where f(x) is the probability density function (PDF) of X c and Ω is the defined domain of X c (Gong et al., 2014). To determine the explanatory power of data concerning a variable, for example, how much of the information of NO 3 − is explained by discharge only, we further extend the concept of continuous entropy to conditional entropy (Thomas & Cover, 2006), where y is the tracer concentration and x is the discharge sequence: Conditional entropy describes how much of variable y can be explained by variable x. To describe the shared information between two data points given as x and y, we apply the mutual information (Shannon, 1948;Sharma, 2000). In our case, we investigate the shared information between the two chosen tracers NO 3 − and SO 4 2− . The mutual information between two measurements is defined as where f x (x) and f y (y) are marginal PDFs of x and y f x,y (x,y) is the joint PDF of x and y (Sharma, 2000). After Sharma, 2000, the mutual information score from equation (4) can be approximated by In this approximation f x (x i ), f y (y i ), and f x,y (x i , y i ) are marginal functions and joint densities at the same point of the same sample (Fernando et al., 2009;Sharma, 2000). To estimate the density, we apply a kernel estimator (Fernando et al., 2009). Without the kernel estimator a theoretical distribution of the MI has to be assumed, which adds more bias to the approach. As nearly all models rely on the interplay of input and output data the shared information through mutual information has to be weighted stronger than the internal information represented through the continuous entropy.

Machine Learning Algorithms
The main aim of the paper is to use discharge data as a predictor for tracer concentrations because discharge in streams and rivers is more commonly measured than tracer concentrations, especially in regions where access to the site is limited and research relies on public databases. Therefore, we train machine learning algorithms using time series of runoff to predict time series of tracer concentrations.
The discharge dynamics are captured by a window of discharge data from the original time series with tracer measurement t′ as input for the machine learning algorithms. The machine learning algorithms predict the tracer concentrations based on information from the discharge pattern ( Figure 1). For training and validation, the predicted tracer concentrations are compared to the measured data (which is considered to represent the reality). To reduce overfitting due to complex input data, an optimal length for the window of discharge data has to be identified, which is discussed in detail in section 2.3. Without defining a window, a Long-Short-Memory network can be applied, which requires a continuous time series of input and training data. Due to a lack of continuous time series of tracer measurements, this approach was discarded.
Four structurally different machine learning algorithms are used in this study: SVM, CART, ELM, and Multi-Layer-Perceptron ANN. These algorithms were chosen because of their suitability for regression problems and their origin in two of the four main machine learning families: error-based learning and informationbased learning (Kelleher et al., 2015). Moreover, they are commonly applied in hydrology and deliver, to a certain degree, structures that can be interpreted by the researcher. SVM and CART are not known to capture temporal patterns in time series data. By the reduction from a complete time series to a window with a variable length, temporal dependencies are reduced to dependencies of the relative position within the window. Thus, the problem is diminished to a pattern recognition problem (Nasrabadi, 2007).
A SVM is an error-based machine learning algorithm that tries to set up a regression to estimate the unknown tracer concentration from the input discharge sequence (solid line in Figure 2(a)). This regression is depicted through a hyperplane, for which the distance to the margin (dashed line in Figure 2(a)) and the most distant feature, the so-called support vector, is maximized (Cortes & Vapnik, 1995;Raghavendra & Deka, 2014). For a linear problem, this fitting of a regression can easily be done, but most of the machine learning problems, as the one presented here, are highly nonlinear. Therefore, we have to transfer the existing problem to a higher dimension where the problem becomes linear with a kernel function (Chang et al., 2010;Kelleher et al., 2015). As the choice of the mapping kernel is highly problem specific, a selection of several kernel functions (radial basis function, linear, polynomial, and sigmoid) was tested and the best kernel was chosen (in terms of numerical stability and computational demands), in our case the radial basis function kernel. For more information on the choice of the kernel, see Vapnik (2013). The created boundary layer is used to predict the unknown tracer concentration C in the feature space through the input discharge dynamic, represented as a green dot ( Figure 2). Accordingly, the SVM tries to solve the regression problem by transferring the discharge data into either a single tracer concentration or a set of tracer concentrations in the multivariate output. Hence, the hyperplane represents the regression function to estimate the respective tracer concentration from the discharge sequence.
CART builds decision trees that are guidebooks to estimate the tracer concentration from the discharge values. The tree shows the ramifications of decisions leading to the final regression result (Breiman et al., 1984;Kelleher et al., 2015;Quinlan, 1986). To build the tree, all discharge values are analyzed in their ability to maximize the decrease of the residuum of the regression between observed and estimated tracer concentration at each branch. The branching occurs on the descending order of error reduction. As a result, the structure of the decision tree can be obtained as guidebook for unknown values, in order to get the desired tracer concentration C (Figure 2(b)). In the given example, the discharge value at position 0 has the highest influence on error reduction and results in the decision between the major branches, which are themselves as diversified as certain discharge values, resulting in the final leaves with the target value C represented as a green dot. The error reduction within the tree for each node is calculated with the root-mean-square error (RMSE) of the regression (see error metrics section). The regression tree analyzes the discharge values to find the values that have the highest influence on the regression problem to determine the predicted tracer concentration. The depth of the CART tree was limited to the number of input values from the time series of runoff in order to capture all details of the variability of discharge.
ANN and a ELM (Figures 2(c) and 2(d)) are both variations of neural networks that try to solve the regression or classification problem by imitating the structure of the human brain and by guiding the training data through a network of hidden layers equipped with neurons (Haykin, 1999). Here, the input nodes are the discharge values from the window of discharge values for estimation of the desired tracer concentration. The hidden layers and nodes represent the underlying system, in this case the karst subsurface system. The connection between nodes and layers is trained by the optimization of weights in order to minimize the regression error. An ELM is a special case of an ANN: The nodes on the hidden layer receive their weights only once. In the following, they remain constant over the process of network adaption. Only the weights from the hidden layer to the output node are updated, which is called a feedforward network due to the update direction of weight (Huang et al., 2004). Here, the discharge values are sent through the network of nodes and hidden layers to identify the pattern and estimate the tracer amount. The network can either be trained to estimate a single tracer or a set of tracers. Generally, the number of hidden layers is restricted to a single hidden layer with half of the input window length as hidden nodes (and a minimum of three hidden nodes for stability reasons).
To avoid overfitting of the data, the number of input data was reduced to a maximum of half of the available runoff data in the window with a minimum of three remaining runoff values as input data. Furthermore, the random selection of input values was shuffled 10 times and the mean prediction was taken to be representative for the specified window length.
Machine learning algorithms depend on the information content of the data (Goodfellow et al., 2016;Kelleher et al., 2015). Consequently, we assume a link between the performance of the algorithm and the information content of the data (defined in section 2.1). We train the algorithms by two different ways: (1) by a univariate strategy estimating each tracer individually and (2) by a multivariate strategy that trains one algorithm to estimate both tracers simultaneously. We expect that the multivariate strategy performs better than the univariate as the combination (i.e., interaction) of data should lead to more incorporated information than just the information content of a single data set. A globally trained algorithm to predict a set of natural tracers would lower the interpretability of the results. Thus, we discarded the idea of a universal machine for tracer concentration prediction but focused on the two mentioned natural tracers.

Training
The discharge data have to be reduced to a window with an unknown length. This optimal length might be highly subjective whether all information on the system's behavior is covered in the respective time span. The window to be selected contains the tracer measurements and the number of discharge values depicting the Figure 1. Workflow of the analysis including the clipping of the window for the discharge data, the prediction of tracer signatures by the machine learning algorithms, and the following comparison with measured tracer measurements.
discharge dynamics. As we do not know whether the window length depends on the chosen approach or region we varied it from 1 to 180 days in steps of [1,3,6,30,60,90,180] with equally sized borders to face the unknown optimal length. The window lengths chosen here represent natural breaks within the classification of time to describe a system. We chose these different lengths of the window to include short-, medium-, and long-term processes in the discharge data and to minimize the number of data sets analyzed. Therefore, we focused on time spans like a month, two months, and half a year. The discharge in the sequence is normalized by the catchment specific average discharge to reduce the influence of the peak. The measured tracer concentrations are also normalized by the specific mean of this tracer for the catchment. The share of the training data is increased gradually to understand how simulation performance is influenced by the size of the training data. Therefore, we varied the amount of data used for training from 10-90% of the available time series for the catchment. Using the length of the covered time series instead would be insufficient because the input data includes runoff sequences that might overlap. Hence, the number of tracer measurements is important.
We train the algorithms with both a univariate and a multivariate strategy. We compare the results from the two learning strategies to quantify the potential improvement of shared information and joint learning. Furthermore, we discuss the influence of the window length on prediction quality. This is relevant as the length of the input sequences can create a bias in the learning process. If we choose the length too short, we might not cover all relevant processes, whereas sequences that are too long might confuse the algorithms in finding a suitable system. In the last step, we elaborate on the transferability of the algorithms to be used as predictors at catchments for which they were not trained. That way, we can test whether machine learning tools and their results might reveal hidden similarities in catchment responses or even more interesting the application of machine learning is suitable for the prediction of missing tracer measurement data.

Evaluation and Error Measures
To compare the different machine learning approaches, training strategies and window lengths, quantitative performance measures were used.
In order to show the general prediction performance, the RMSE was applied for observed and estimated tracer measurements, which becomes 0 for a perfect prediction. To calculate RMSE for the tracer content, we differentiated between measured and predicted c T , with N being the number of samples in the validation: We apply RMSE for both tracers individually and calculate the mean of both as an indication of the combined error. Because of the variable window length, individual RMSEs are calculated for each approach and each region. As the normalization in RMSE does not show the direction of error in contrast to the mean error which is less robust against outliers, we also analyze the average concentration ratio c T that provides information about the general strength and direction of the error of prediction: c T is able to show the direction and the strength of the error by its sign and its difference from one, respectively. Again, because of the multitude of different window lengths, a range of c T values is calculated for each region and approach.
As all the presented measures are merely a measure of quantitative performance, the qualitative performance is measured with the accuracy of the internal ranking of the two tracer signatures. Therefore, we calculated the accuracy by an error matrix of true and false combinations of ranking. The deducted measure of accuracy acc is able to describe the qualitative information between the two tracers as an accuracy with a ranking (Han & Kamber, 2010): With pos True and neg True as the ranking of the pair of tracers in concentration, for example, c TAobs > c TBobs but c TAest < c TBest results in a neg prediction, whereas c TAobs > c TBobs and c TAest > c TBest counts as pos True . Accuracy shows the ability of the machine learning method to replicate the ranking of the tracer concentrations in order to replicate changing tracer dynamics.
The three measures considered here to judge the performance represent the major key characteristics of the prediction results. The overall goodness represented as the RMSE, deviation from the mean and the ranking between both tracers. So, by a correct ranking the qualitative information that tracer concentration dominates is still captured, even though the variance of the prediction is not high enough.

Data
The target variables of the machine learning prediction are the concentrations of SO 4 2− and NO 3 − that act as a combined tracer signature. While NO 3 − is known as an indicator for fast water fluxes from the soil or epikarst, that is, the shallow subsurface (Hartmann et al., 2016;Mahler & Garner, 2009), SO 4 2− in karst systems is usually derived from geogenic processes that dissolve evaporates in the phreatic subsurface that sustains base flow (Hartmann et al., 2017;Mudarra & Andreo, 2011). We chose these two tracers as an example for any tracer combination. Due to their different origins, the shallow subsurface (NO 3 − ) and the phreatic zone (SO 4 2− ), we expect that their observations of dissolved evaporates include different information.
The data for our analyses originate from seven different karst springs in France (Table 1 and Figure 3). Tracer measurements were normalized by individual mean values, leading to seven different means (Eaufrance, 2018a). The tracers analyzed in this study are natural tracers; no human-induced injections were made. The tracer concentrations were measured repeatedly, but not at fixed intervals. There was a strong linear correlation between both tracers SO 4 2− and NO 3 − with r = 0.67. Measured discharge values were obtained from Banque Hydrologique and have a daily resolution (Eaufrance, 2018b). Banque Hydrologique publically provides daily discharge data of continuously measured rivers and springs collected by French state agencies.
The two springs Baget and Fontestorbes are located in the Pyrénées Mountains (Ariège department) at a median altitude of 1,000 m. The recharge areas are 13 and 80 km 2 for the Baget and Fontestorbes spring, respectively. Mean daily discharge of the Fontestorbes spring, which is one of the largest intermittent karst springs in the world, is 2.1 and 0.5 m 3 /s at the Baget spring. Due to the similarity of the two midaltitude basins (Labat et al., 2002), mean annual precipitation of 1,178 mm (Bailly-Comte et al., 2018) can be assumed for both locations. The Durzon spring is located on the Larzac Plateau in the Grands Causses area in the Massif Central (Aveyron department). It is a perennial, vauclusian-type spring with a mean daily discharge of 1.5 m 3 /s. The recharge area has been determined to be >100 km 2 (Jacob et al., 2008). The Fontaine de Vaucluse spring is a well described and famous karst spring being the largest karstic outlet in France (Vaucluse department). The mean daily discharge is over 20 m 3 /s and the low flow discharge is always higher than 4 m 3 /s. The recharge area is about 1,115 km 2 (Fleury et al., 2009). The Fontbelle spring is part of the Ouysse karst system (Lot department) (Kavouri et al., 2011). The Source de la Touvre is the second largest karst spring in France and the sole outlet of Rochefoucault karst system (Charente department). The spring, fed by the losses of three large rivers, has a mean daily discharge of 13 m 3 /s and a recharge area of about 126 km 2 . The water resources are used for the water supply of Angouleme city. The Source du Lez is the main perennial outlet of the Lez karst system (Montpellier department) with a mean daily discharge of 2 m 3 /s. Pumping for the water supply of Montpellier city puts the aquifer under high anthropogenic pressure (Bicalho et al., 2017).
More details about the springs are provided in Table 1 and Figure S1 (see supporting information) or at data base webpage (hydro.eaufrance.fr).

Entropy and Mutual Information of Available Data Sets
Following the principle of continuous entropy, the information content of discharge and the mutual information of the joint data sets (tracer signatures) was calculated. We resampled the complete set of sequences ten times and looked at the mean entropy of each individual data set and the mutual information of two different tracer signatures, SO 4 2− and NO 3 − . Missing or erroneous results are labeled NA, which leads to gaps shown in the information contents of springs like Fontaine de Vaucluse (see supporting information).
The Baget example shows that the entropy of discharge decreases when more data are used for training (Figure 4). The mutual information between the two tracers exceeds the continuous entropy of discharge by far. The information content shared between those two tracers is 35 times higher than the continuous entropy of the discharge. That means that we need a lot of information to fully describe the variability of the interplay between those two tracers and we might not successfully describe this variability with the discharge data alone. Using more than 60% of the available tracer data sets, the mutual information reaches a plateau where no further information is needed to describe the dynamics. The behavior of MI is similar for all other catchments: The information content is by far higher than the continuous entropy of discharge and 10.1029/2018WR024558

10.1029/2018WR024558
Water Resources Research a plateau is reached using at least 60% of data. Therefore, we assume that we need at least 60% of the available tracer measurements to cover the variability of the system's dynamics in the training. For more details, we refer to the supplement ( Figure S2) where the entropy and the mutual information for all catchments is shown in detail.

Validation of Prediction Accuracy
For the validation of the prediction accuracy, we compared two different learning strategies: the univariate strategy, focusing on only one tracer at a time, and the multivariate strategy, considering both tracers at the same learning phase. The results shown here represent all considered sizes of the discharge window. The prediction results are presented as a boxplot to show the variability and the influence of the different window lengths without going into detail on the specific influence of the window ( Figure 5). The average tracer concentration ratio c T indicates that the tracer signatures can be predicted better at some springs than at others. Furthermore, they show a preference toward certain prediction techniques with a c T value close to the optimum value. For the Fontaine de Vaucluse, Fontbelle, Sources de Fontestorbes and Source du Lez, c T converged to the optimal value 1.0. The differences between the machines were marginal, although ELM and ANN results were less variable and thus less influenced the amount of training data. For the Baget catchment, we could not predict the concentrations with any machine as the variability is high for all applied approaches and amounts of training data. For the catchments Durzon and Source de la Touvre either NO 3 − or SO 4 2− was overestimated or underestimated, although CART delivered acceptable results for the Source de la Touvre.
The RMSE of the prediction from all investigated window lengths is presented as a boxplot in Figure 6. The RMSE of the tracer concentrations shows similar results like c T . While for some catchments RMSEs were low regardless of the chosen machine, for catchments like Baget the results are worse than for catchments like Fontbelle and Source de la Touvre. If the c T of the catchment does not converge to 1.0 (like the SVM in Source du Lez), the RMSE is higher than in regions like Fontaine de Vaucluse and Fontbelle where c T is also close to the optimum. The choice of the machine has only small influence on the RMSE, apart from Source du Lez where the SVM delivers worse results than any other method. Generally, a RMSE lower than 1.0 is an acceptable value for the prediction of the normalized concentration. This limit is reached for all machines in the catchments Fontaine de Vaucluse, Fontbelle, Source de Fontestorbes, and Source de la Touvre while at Baget, Durzon, and Source du Lez the RMSE remains highly variable. Whether a univariate or a multivariate approach results in a lower mean RMSE cannot be stated with certainty from these results, but in most cases the mean RMSE of the multivariate approach was lower than the mean RMSE of the respective univariate approach.

Water Resources Research
The Acc value describing the correct ranking of tracer concentrations shows for all catchments that at least 40% of the rankings are estimated correctly (Figure 7). None of the machines reached mean Acc values >70%.
Here, the choice of machines has an influence on the dynamics of the tracer concentrations. The Acc values were highest for catchment Baget compared to all other catchments, while showing the highest variability of c T . The multivariate prediction does not automatically improve the results in terms of Acc at all catchments, and the improvement or deterioration varies among the applied approaches (e.g., SVM and ELM in Durzon). The reason behind this might be found in the interplay of information content, regional aspects of the catchment, and the quality of the input data. Therefore, it is out of scope of this paper to check the causality of the preferred choice. Nevertheless, in most catchments, the multivariate machines improve Acc. Again, the choice of the machine has less impact on results and it is merely a catchment specific question.
The influence of the chosen window length on the prediction capability of NO 3 − and SO 4 2− is exemplified by the c T values of all four (univariate) machines in catchment Source de Fontestorbes (Figure 8). Generally, either very short windows (1-4 days) or long windows (>60 days) lead to good results, while window lengths in between worsen the results for SVM, CART, and ELM. For further information on the window dependency of the other catchments, which are very similar to the information we derived from our example, we refer to the supporting information.
As a good example for choosing an approach with the required number of training data for a catchment, we elaborate the case of Fontbelle ( Figure 9). Here, ANN and SVM obtain c T values close to the optimum of 1.0, but the ANN results in lower RMSE values than the SVM. Therefore, we chose the ANN to predict tracer concentrations in this catchment. The resulting time series (predicted by an ANN trained with 70% of the available measurements) reveals that the measured tracer concentrations and the predicted time series show an acceptable agreement with the mean value of concentration captured as well as the general ability to predict concentrations at all levels measured.
Taking a closer look at the prediction capability for SO 4 2− , we can see that the multivariate approach interpolates concentration in the same range, even close to a concentration of 0.0 mg/L (red marked area in Figure 9). The multivariate approach is able to cover the peaks, while the univariate approach predicts values close to the mean concentration. Interestingly, the mean tracer concentration rises over time using the univariate approach. However, the behavior NO 3 − is different: The univariate prediction shows a variability that reflects the measured tracer concentrations better, while the multivariate machine predictions show too low variability around the observed mean concentration. As shown by the red marked area of Figure 8, the univariate approach allows interpolating NO 3 − concentrations from Day 2,000 to Day 3,200. The following decreasing trend cannot be interpolated, and thus, the approach lacks a significant performance here from Day 3,200 until the end.

Discussion
Missing tracer measurements in terms of gaps or irregular measurement campaigns are the major downside in using these data to develop models for system characterization. In many cases, it is not possible to repeat the measurements for the desired tracers, for instance, when data are obtained from online databases like the U.S. Geological Survey. Furthermore, only limited knowledge is available on the information content of the data used in tracer-aided modeling (Hartmann et al., 2017;Kelleher et al., 2019). Our results indicate that machine learning algorithms represent a valuable technique to predict some characteristics of tracer concentrations in the karstic environments. Even though none of the machine learning methods were able to describe the complete dynamics between the two tracers with high precision, our comparative approach of using different machine learning methods allows us to choose the most appropriate method describing a specific characteristic at a specific site. Hence, we are able to predict key characteristics like the mean concentration and the relative ranking of tracers in a joint tracer analysis. The reason that tracer concentration dynamics could not entirely be predicted by discharge alone is the low information content thereof compared to the shared information of the tracers. The use of ancillary data or more sophisticated approaches to improve our prediction is hampered by data limitations or unsecure quality (in terms of measurement quality). Consequently, the prediction capability of the algorithms is lowered by the limitation to discharge data and the low temporal resolution of concentration measurements. Thus, results have to be interpreted carefully and with special regard to the information content of the underlying data.

10.1029/2018WR024558
Like for other machine learning applications in hydrology, the choice of the most promising algorithm has to be found through trial and error (He et al., 2014;Raghavendra & Deka, 2014). Hence, we adapted the research design to the No-Free-Lunch-Theorem (Wolpert & Macready, 1997) and compared four different algorithms from two of the main machine learning families (Kelleher et al., 2015). We assumed that discharge data are able to provide enough information to describe the interplay between tracer measurements and to predict the concentrations. However, the continuous entropy of discharge and mutual information between NO 3 − and SO 4 2− emphasized that the information needed to describe the interplay between this pair of tracers is far higher than the continuous entropy of the discharge data alone. Although the algorithms were able to predict certain aspects like the mean concentration and peaks quite well, the complete variability could not be predicted. In contrast to concentration-discharge relations that require distinct knowledge on the measured data and the catchment, our study shows that machine learning algorithms can be trained from databases with few discontinuous measurements to provide continuous reconstructions of tracer concentrations.
With knowledge on the required information content and the delivered information content, we were not able to distinguish properly among the different approaches and a further choice would depend strongly on the focus of the task: Would we like to predict the tracer concentration, or is the ranking of tracer methods for the dynamic description more important? This lack of a clear preference of the chosen machine learning methods can also be observed in other comparative machine learning studies in hydrology, for example, in flood event separation (Mewes & Oppel, 2019) and the simulation of streamflow (Shortridge et al., 2016). Similarly to their results, there might not be a single machine for all purposes that works with our data set, but a set of machines that work together to deliver the desired results, which was shown to be useful for hydrological modeling in general (Clark et al., 2008). We assume that the interplay of the information content of the tracers and discharge determines the choice of the best working algorithm. This assumed link between information content of data, prediction performance, and method preference might be a way to regionalize karst catchments by a data-driven approach (Abdollahi et al., 2017).

Water Resources Research
Consequently, our comparative analysis of algorithms and learning approaches allowed setting up a strategy to use the aforementioned algorithms to predict tracer signatures. Interestingly, the length of the input sequence of discharge consists of two groups: a group that prefers short windows and a group that prefers long windows. This might be related to different processes that relate to the transition time of the karst spring, which means that we use the information of the time spent by the water in the karst system (Hartmann et al., 2016). While SO 4 2− requires long times to dissolve from the karstic rock to the water, NO 3 − dissolves faster. This is the reason that the two tracers are investigated: to separate slow from fast water. Here, SO 4 2− could be predicted better by long windows of input data, while NO 3 − had higher performances with short input windows.
Apparently, the information that we use right now is sufficient for peak concentrations and the mean values, but concentrations of SO 4 2− close to nearly 0.0 mg/L lead to errors ( Figure 9). Hence, processes that lead to low SO 4 2− concentrations in the discharge are not yet covered by the discharge data and should be included with ancillary data. Such multi-input machine learning applications are widely used in remote sensing and other applications but underrepresented in hydrology because knowledge on the information content of the input data is crucial for their application and that remains unknown in many hydrological studies (Mountrakis et al., 2011;Piotrowski et al., 2007;Zheng et al., 2015).
Overall, our investigations show that we cannot state a clear preference toward a single approach. However, the introduction of a comparative framework helps to identify the most appropriate solution to predict tracer concentrations for a specific catchment. In the following parts of the discussion, we adapt our concept of entropy and present a preliminary framework that could be used to predict tracer concentrations.

Improvements for Concept of Entropy
Due to the mixed results of the multivariate approach, we analyzed the results of both approaches, univariate and multivariate, as an example and learned that we need one tracer to predict the other. As we can see from the interpolated time series of catchment Fontbelle, the multivariate approach performed better for SO 4 2− than for NO 3 − . Therefore, the additional information from NO 3 − helped the algorithm to find the pattern in SO 4 2− . Hence, a framework should consist of a univariate ANN to predict NO 3 − , which acts as additional information to predict SO 4 2− .
To reveal the relationship of explanatory power between predictors and variables, we transfer the concept of mutual information to conditional, or relative, entropy (Chacon-Hurtado et al., 2017;Corzo & Solomantine, 2007;Keum & Coulibaly, 2017). The conditional entropy shows that NO 3 − has a higher conditional explanatory power than SO 4 2− to be predicted by discharge (Figure 10). This means that a univariate approach is  . Consequently, we can use the concept of conditional entropy to decide whether a univariate or a multivariate approach should be preferred and which tracer measurement can be used as ancillary data for the prediction of other tracer concentrations.

Application of Machine Learning in Interpolation of Tracer Time Series
Discharge separation by tracers relies on tracer observations which are often limited in availability (Birkel & Soulsby, 2015;Klaus & McDonnell, 2013). We assumed that machine learning is a tool to interpolate time series of tracers by discharge observations. Keeping the aforementioned downsides of machine learning in mind, the shown interpolation capability of the algorithms is a valuable addition to discharge separation applications (Garvelmann et al., 2017;Klaus & McDonnell, 2013).
As the explanatory power of discharge alone is too low to describe the interplay between the tracers in all its variations, the question toward the filling of the gaps by machine learning tools has to be precise. In our framework, an extensive preanalysis was conducted to show the general applicability in terms of RMSE and c T for all considered algorithms and amounts of available training data. The length of the input sequence again is a source for uncertainty in our approach, but we were able to link good prediction results with the geochemical residence time of the tracer in the system. So, for hypothesis testing on transit times, the machine learning approach can be utilized. To describe the uncertainty of the prediction, both lengths of input sequences should be used: a short window length of discharge to catch short residence time processes and a long window of discharge to catch slow processes. Nevertheless, the definition of short and long windows is catchment specific and has to be determined either by a data-driven preanalysis or detailed knowledge of the respective catchment, which would be identical to the calibration of a hydrological model Wu & Chau, 2011).

Conclusions
Our initial study focus explored the use of machine learning algorithms for the prediction of tracer measurements. Since time series of tracer measurements are often too sparse for modeling, machine learning tools can potentially be useful for researchers with limited access to environmental tracer data or limited resources to obtain additional measurements. We could show that our selected machine learning tools were able to identify some characteristics of the observed tracer concentrations like average concentrations or the appropriate constellations of tracer concentrations at the selected test sites. Our analysis also revealed that the information content of discharge alone is not sufficient to predict tracer concentrations with all its entire variability, as the mutual information between the pairs of tracers is higher than the continuous entropy of the discharge data. For that reason, the prediction capability of the machine learning algorithms is lowered substantially. The interpretation of the predicted time series has to be done with care, because the predicted time series lack extreme concentrations that are abundant in the observations. Moreover, we were able to build a preliminary framework that creates an ensemble of predictions addressing the uncertainty of a machine learning-based approach by eliminating the bias of the chosen input sequence length and the learning approach of the algorithms. All methods considered in this paper deliver acceptable results in comparison, but the choice of the most suitable algorithm remains catchment specific and should be based on site-specific knowledge (e.g., residence time estimations) or extensive data-driven preanalysis. We found that the amount of required training data is high, as the mutual information between the pair of tracers requires at least 60% of the available data to reach a plateau. Hence, the training of the machines is not likely to be successful in data-poor regions.
We conclude from our investigations that the setup of a framework to predict tracer concentrations with machine learning tools remains challenging. Nevertheless, we show that the process of setting up the machine learning-based ensemble framework can be facilitated by information-based analyses like the concept of entropy, conditional entropy, and mutual information. Knowledge on the information content of the data helps to justify the nonobvious choice of methods facing "black-box" machine learning approaches. Moreover, they could be the basis for future regionalization of catchments and the transfer of trained machines to data-poor regions, in case the machine learning approaches were trained in information-rich environments. By the training with information-rich training data, linkages between processes that are hidden in data, like discharge data, become transferable and quantifiable. Hydrological models, on the other hand, require the same amount of data regardless of their information content. So measurements too few for traditional hydrological models may still contain sufficient information to improve machine learning models. Overall, we are just at the doorstep to use data-driven approaches in hydrology, especially in complex environments like karst. Disregarding the problems that we still have to face in the future, advanced datadriven machine learning approaches may allow further improvements of data analysis, model calibration, and model development.
Although there is no silver bullet in predicting tracer concentrations, we could show by the input window analysis that the characteristics of the assumed transit time of tracers becomes visible in the most suitable input window lengths for the prediction. However, through analyzing on how a machine learns data patterns and investigating the results of the prediction, our study highlights the importance of an information content analysis. This opens the field of further entropy-based approaches of data mining in hydrological contexts, especially in often data-sparse applications like karst hydrology.