Diagnosis of Model Errors With a Sliding Time‐Window Bayesian Analysis

Deterministic hydrological models with uncertain, but inferred‐to‐be‐time‐invariant parameters typically show time‐dependent model errors. Such errors can occur if a hydrological process is active in certain time periods in nature, but is not resolved by the model or by its input. Such missing processes could become visible during calibration as time‐dependent best‐fit values of model parameters. We propose a formal time‐windowed Bayesian analysis to diagnose this type of model error, formalizing the question “In which period of the calibration time‐series does the model statistically disqualify itself as quasi‐true?” Using Bayesian model evidence (BME) as model performance metric, we determine how much the data in time windows of the calibration time‐series support or refute the model. Then, we track BME over sliding time windows to obtain a dynamic, time‐windowed BME (tBME) and search for sudden decreases that indicate an onset of model error. tBME also allows us to perform a formal, sliding likelihood‐ratio test of the model against the data. Our proposed approach is designed to detect error occurrence on various temporal scales, which is especially useful in hydrological modeling. We illustrate this by applying our proposed method to soil moisture modeling. We test tBME as model error indicator on several synthetic and real‐world test cases that we designed to vary in error sources (structure and input) and error time scales. Results prove the successful detection errors in dynamic models. Moreover, the time sequence of posterior parameter distributions helps to investigate the reasons for model error and provide guidance for model improvement.


Introduction
Water resources research and management is increasingly relying on mathematical models to gain a deeper system understanding, to predict a system's current state (now-casting), and to predict a system's state under future conditions (forecasting). Unfortunately, from a philosophical point of view, all models are wrong because they are simplifications of the real system (G. E. P. Box, 1976). Model-structural errors generally arise when the dominant processes are not comprehensively and correctly represented in a model. It is important to investigate the occurrence of structural errors for several reasons: (1) Model errors lead to biased predictions (Brynjarsdóttir & O'Hagan, 2014;Ajami et al., 2007) and hence compromise the usefulness of a model for practically all modelling purposes (process identification, nowcasting, forcasting) and subsequent management decisions (Sargsyan et al., 2019;Steinschneider et al., 2015). (2) Model errors lead to biased parameter identification through calibration or Bayesian parameter inference (Xu & Valocchi, 2015) or cause non-identifiability of parameters (Menberg et al., 2019;Reichert & Mieleitner, 2009), because parameters are optimized in the wrong context; they are adjusted regardless of their physically feasible ranges to obtain a satisfying agreement to observed data despite the model error (Oreskes et al., 1994).
Examples of misspecified mechanisms or missing mechanisms in hydrological models are given by Gupta et al. (2012). Model uncertainties and errors typically exist in model inputs (forcings) (Kuczera et al., 2006;Del Giudice et al., 2015;Del Giudice et al., 2013), model building (Clark et al., 2008), and observations of model output (Renard et al., 2010). Working with a model with structural errors becomes a problem, if we are interested in true parameter values that relate to property or process identification, or if we make predictions by extrapolating a physically-based model beyond calibration conditions. When parameters compensate for model structural errors, they lose their physical meaning (Xu et al., 2017;Reichert & Schuwirth, 2012). Such errors can also show more subtle, e.g. through the fact that dynamic parameters or a nonstationary model status improve model performance (Wagner et al., 2016;Pianosi & Wagener, 2016;Getahun & Van Lanen, 2015;Morán-Tejeda et al., 2015). Examples for processes that often cause time-dependent model structures in hydrological systems are soil cracking-shrinking and freezing-thawing cycles, complex vegetation dynamics, or transient changes of river morphology. Model-structural errors can also be caused by neglecting major processes (e.g. macro-pore flow) or by the application of ill-suited model concepts.
Model-structural errors are currently addressed by methods that typically follow one of two primary directions: error detection or error mitigation. To detect error occurrence, the temporal change of model performance is often used as an indicator. Klemeš (1986) presented a generalization of the routine split-sample test. Their study shows varied model performance in the calibration period, which is associated with varied model conditions or model inadequacy. Following this concept, Coron et al. (2012) proposed the generalised split-sample test that uses a sliding window to calibrate the model with subsets of the time series data and to study model deficiencies by analysing the parameter transferability. Motavita et al. (2019) proposed the comprehensive differential splitsample test (CDSST) to study (non-)stationarity of calibrated parameters. This research even suggests that, for a model with errors, the choice of hydrological conditions in calibration is more important for model reliability than the length of the calibration dataset. Westra et al. (2014) proposed to use parameter nonstationarity as a hint to diagnose modelstructural errors. Pathiraja et al. (2016) allow parameters to vary in time and use data assimilation to detect temporal patterns. While the mentioned approaches rely on common model evaluation metrics, Ruddell et al. (2019) propose to diagnose structural error by quantifying the trade-off between functional and predictive performance with informationtheoretic measures.
Another approach to detect structural errors is to compare the performance of competing model candidates in calibration on varied datasets. The underlying assumption is that structural errors show by a significant decrease in prediction quality of any specific data type when calibrated on a different data type -parameter compensation is preventing the model from finding a robust best-fit parameter set for all relevant data types. For example, Wöhling et al. (2013) use multiobjective calibration as a diagnostic tool to identify structural errors in coupled soil-plant models. In a later study,  use Bayesian model selection (Raftery, 1995) to evaluate the worth of different calibration data types. Bayesian model selection (BMS) works under the assumption that one of the models in the set is true; if the identified (pseudo-)true model changes depending on the chosen calibration dataset, this is an indication of varying expression of model-structural errors with varied calibration strategy. A review of BMS in a more general model context is given by Höge et al. (2019). Del Giudice et al. (2015) assess the decrease of model-structural errors with increasing model complexity and draw conclusions about the relative importance of model-structural errors vs. input errors for the quality of model predictions.
To mitigate the impact of model-structural errors on model performance, the majority of approaches treat the apparent bias in model output, i.e., these approaches weaken the commonly violated assumption of uncorrelated errors in the likelihood function/objective function during calibration. For example, Ammann et al. (2019) investigate various error models and report that including non-stationary autocorrelation improves model performance. In contrast, however, Barkle et al. (2011) reported unacceptably large prediction errors when using the AR−1 model in the likelihood formation. Smith et al. (2015) summarize the performance of eight residual-error models and report significant change in the posterior mode of their key model parameter as a function of the error model. Generalized likelihood uncertainty estimation (GLUE) (K. Beven & Binley, 1992) is a variation of the likelihood formulation, which explicitly expresses the autocorrelation in errors and widens the formal likelihood function to subjective limits of acceptability based on expert knowledge (Ragab et al., 2020;Mirzaei et al., 2015). Instead of correcting the likelihood function, Demissie et al. (2009) propose a data-driven error model to reduce the bias in groundwater model predictions.
Another branch of studies aims to tackle structural deficits within the model, not only "at the end of the pipe". For example, Xu and Valocchi (2015) infer a Gaussian process error model to improve parameter estimation and model prediction quality. Reichert and Mieleitner (2009) introduce stochastic time-dependent parameters to correct for structural errors within the model. Their goal is to achieve more realistic prediction uncertainty intervals.
In this work, we are focusing on model error detection, because we believe that effective model improvement (making the model do the right things for the right reasons) has to rely on solid error detection as a foundation. This sequential procedure seems scientifically more satisfying than aiming for better model performance without knowing the reasons for the model's misbehavior. We therefore present a method for error detection here, and leave an extension toward error mitigation for future work.
Despite the increasing scientific interest in model-structural errors, existing methods still suffer from major limitations. Current methods for diagnosing model-structural errors require much a-priori, expert knowledge about model errors (Westra et al., 2014;Hrachowitz et al., 2014), trade statistical rigor for easy applicability and/or suffer from numerical inefficiency (Reichert & Mieleitner, 2009). The role and limitations of hypothesis testing for identifying of model-structural errors has been recently discussed by, e.g., Nearing et al. (2016); Neuweiler and Helmig (2017); K. J. Beven (2018); Nearing et al. (2020). Nevertheless, these studies agree that, until now, the hydrological community is lacking a rigorous framework specifically designed for the diagnosis of model-structural errors.
We intend to fill this gap by proposing a time-windowed Bayesian analysis framework to detect model-structural errors. We target time-dependent errors in dynamic models, but do not assume any a-priori knowledge about error sources or error patterns. In the spirit of Bayesian hypothesis testing, we identify periods in the calibration time series that feature statistically significant model error.
We start from the assumption that the model can describe the real world system sufficiently well for most of the time, except when some unresolved mechanism occasionally occurs. This means that we can claim the model to be quasi-true for specific periods in "time windows", in the sense that it fits the observation data sufficiently well within, e.g., one or two days or a week. Only when we increase that time window, we put the model into potential trouble: if the increased time window now contains data impacted by any unresolved mechanism, the calibration goodness-of-fit will deteriorate, and parameter identification will become visibly ambiguous. At that stage, the model parameters are forced to move away from their previous (allegedly) well-calibrated state, in order to compensate for the now occurring structural error.
We detect this problematic phase by calibrating the model within a pre-defined shorter time window and moving this time window through the whole data time series (hence the name of our method, "moving time-window analysis"). While moving the time window, we track model performance and search for sudden decreases that would indicate an onset of model error.
We built our framework on the mathematically rigorous grounds of Bayesian probability theory and use Bayesian model evidence (BME) as the model performance metric. BME is perfectly suited to judge a model's performance under parameter and data uncertainty, because it evaluates the model's likelihood to have generated the data, averaged over the model's parameter space. We use a brute-force Monte Carlo implementation of BME (Schöniger et al., 2014) as a basis for our proposed routine. This numerical implementation is computationally heavy, but often practically feasible. The addi-tional effort for model error detection is moderate compared to the baseline effort of setting up a predictive ensemble for brute-force Monte Carlo Bayesian updating, because it does not require any additional model runs.
To identify statistically significant drops in performance that are triggered by an onset of model error, we introduce a tBME reference distribution as sampling distribution. We start by asking, what should the tBME value be if the model was true (i.e., errorfree)? We randomly sample a realization of model output as synthetic dataset to generate a series of theoretical tBME values. We repeat this random sampling procedure many times to produce a random sample of the tBME series that accounts for the full output distribution of the model. This procedure is similar to the bootstrapping analysis on BMS results introduced by Schöniger et al. (2015), with the synthetic data here being drawn from the model's output distribution instead of being resampled from noisy observed data. Sampling the distribution of model states to perform "hypothetical calibration" is known from Bayesian pre-posterior analysis (see, e.g., Leube et al., 2012;Nowak & Guthke, 2016). With that tBME sampling distribution, we have a reference for each considered time window. If the tBME value given real data in a specific time window falls outside of the sampled range, it is very unlikely that the model is true, and hence it is very likely that we spotted the onset of model error. If, in contrast, the tBME value given a field dataset falls within the reference interval, we conclude that the model fulfills the statistic requirement of a true model at this moment. This model can be considered as a quasi-true model with (statistically) negligible structural error. Technically, this is a fully data-driven likelihood ratio test, i.e., free of parametric assumption about the distributional shape of model errors.
Our Bayesian approach does not presume any model error pattern, and it can handle any choice of formal likelihood function. The method provides an informative and intuitive visualization of model error occurrence on varied temporal scales, simply by changing the size of the sliding window. We will further show that this method is able to detect superposed errors, if they occur on different temporal scales (e.g. daily and seasonal patterns). In contrast to most existing methods, our approach follows a data-driven bottomup perspective: there is no need to make parametric assumptions about the type of model error first and then to formulate and perform a parametric hypothesis test as in Neuweiler and Helmig (2017); rather, we let the observed data speak to identify the onset (and offset) of error in a data-driven hypothesis test. Only in subsequent steps, the modeller may investigate the error period more closely and try to formulate hypotheses about the reasons for model error.
By detecting model errors with our proposed analysis, we set the grounds for improved system understanding and potential model improvement in further model building. Detection of errors on different temporal scales is especially useful in hydrology, since it helps improve our conceptual understanding in general and because it improves hydrological predictions and their usefulness in all practical aspects. Nevertheless, the application of this method can be generalized to any dynamic system with long time series available for calibration.
In summary, the main contributions of this study are: 1. We develop a data-driven method for model structural error detection in a statistically rigorous Bayesian framework, with no prior assumptions about error sources or patterns. 2. By introducing a reference range (sampling distribution) for the model performance metric (BME), we formalize the question: "In which periods does the model statistically disqualify itself as quasi-true?" to a data-driven likelihood ratio test. 3. Our proposed approach is designed to detect error occurrence on various temporal scales, with the choice of temporal scales to be investigated left to the modeller. Hence, our framework generalizes well to arbitrary dynamic systems with long calibration time series, but allows for individual adaptation to specific applications and scientific research questions.
The only restriction to our procedure is that the time scale resolved by the available data time-series must be smaller, and the total length of the available data must be larger than the time scale of errors to be diagnosed. The article is structured as follows: Section 2 introduces the proposed method, especially the time-windowed Bayesian analysis and the reference range. Section 3 presents the model and the field data used in our case study, and defines several synthetic test cases to validate our method. We discuss the implications of the synthetic cases and interpret the time-windowed analysis for the real data in Section 4. Finally, we conclude about the strengths and limitations of this method in Section 5.

Methods
To investigate the temporal occurrence of model error, we propose to perform a timewindow-based analysis of model performance, with the performance metric being BME. Assuming that the model is generally fitting the calibration data well, it should receive a high BME score most of the time; only in periods with missing/misspecified mechanisms acting, the model's performance will deteriorate significantly. This sudden misfit will cause BME to decline, and hence we can use the time-window based BME (tBME) as model error indicator.
The challenge is now to distinguish "noisy fluctuations" in BME from statistically significant declines that actually represent model error. Since there is no meaningful interpretation of BME values on an absolute scale, we need a reference to compare the obtained BME value to. We therefore determine a reference interval which contains hypothetical BME values under the assumption that the model is in fact true and error-free (the fact that we still see an interval and not just a single value is due to parameter uncertainty in our model). We can then perform a time series of hypothesis testing: if the model is error-free per window, the corresponding tBME value given observed data should fall within the hypothetical interval. This means at this time scale (of the time window size) the model can be considered "pseudo-true", i.e. the model is representative of the true system in a statistical sense.
If the tBME value drops outside the interval, we can conclude that the current model setup does not accurately describe the observed system state in a given time window, i.e. that model error occurs. This procedure is schematically illustrated in Fig. 1.
In the following, we will explain the theoretical development of our proposed method. Since the method is settled in a Bayesian framework, we start with fundamentals of Bayesian model evaluation in Section 2.1. Then, we explain the moving time-window Bayesian analysis in Section 2.2, mention relevant time scales in Section 2.3, and discuss the theoretical behavior of the tBME curve for varied time scales in Section 2.4. Section 2.5 describes the construction of the reference interval for tBME interpretation. As a side product of time-windowed calibration, we obtain dynamic posterior parameter distributions introduced in Section 2.6.

Bayesian Model Evidence (BME)
The Bayesian framework allows to quantify an uncertain state of knowledge, which can be updated in the light of observed data. An uncertain quantity of interest y predicted by a model M can be expressed as a random variable with a prior PDF p(y|M ), where PDF stands for probability density function. By determining the likelihood of the model prediction to have generated the data, p(D|y, M ), this prior PDF is updated to a posterior PDF p(y|D, M ): with time-windowed BME (tBME). If the tBME value given data falls within the sampling distribution of tBME, the model can be considered "pseudo-true" during this period (blue crosses); if the value falls outside the sampling distribution, this is an indication of model error (red cross).
The term in the denominator, p(D|M ), is a model-specific normalizing constant that marginalizes over the posterior PDF, and represents the model's marginal likelihood. This term describes how likely it is that the model M has generated the observed data D. It is therefore also called Bayesian model evidence (BME), which can be expressed as with the last row showing a brute-force Monte Carlo numerical approximation of BME (Schöniger et al., 2014). A large number N M C of random parameter realizations ω i is drawn from the parameters' prior PDF p(ω|M ) to form a Monte Carlo ensemble, the corresponding model predictions y i are determined, and their likelihoods are averaged to approximate the Bayesian integral. This approximation is assumption-and bias-free, but computationally expensive. For other possible approximations of BME, readers are referred to Schöniger et al. (2014); P. Liu et al. (2016) and Oladyshkin and Nowak (2019). BME is perfectly suited as a model performance metric in the face of uncertainty (in parameters, forcings, and output data), because it balances a model's goodness-offit with a penalty for overcomplexity, see Höge et al. (2018) for a review on this topic. To give a simplistic explanation, think of a 1D prediction: BME is exactly the PDF value of the model's predictive distribution read off at the observed data value. The numerical approximation in Eq. 2 avoids the problem of estimating high-dimensional PDFs (as would be necessary for long time-series of calibration data) by sampling the low-dimensional parameter space instead. Thus, it obtains a sample of the predictive PDF by running the forward model on the parameter sample.
As likelihood function to represent measurement error, we choose a Gaussian distribution centered about the observed data D: with N o representing the number of observations (i.e., the length of D). The model misfits (residuals) D−y i are attributed to uncorrelated measurement errors, i.e., error covariance matrix R of size N o x N o has non-zero elements only on its main diagonal. It is clear that model errors will show a complicated pattern of correlation (see, e.g., Evin et al., 2013), and many authors used the likelihood function to also represent model errors (Ammann et al., 2019;Smith et al., 2015). We are not doing so, because we do not want to prescribe any specific model error pattern a priori. Instead, we let the data speak and identify the model errors for us. We therefore propose to start with the simplified assumption of uncorrelated errors in the model error detection phase, and recommend to use the result of this analysis as input for more realistic modelling of model errors (or even better: improvement of the underlying conceptual or physics-based model) in a subsequent phase of model error mitigation. A modeler could also treat random measurement error and known systematic model error in a lumped fashion, in order to detect further yet unknown errors. Besides, the choice of likelihood function is not limited to the one we show here, and does not influence the functioning of our framework.

BME with a Moving Time-Window (tBME)
Recall that we assume that the model under investigation is generally correct except for some unresolved processes that occasionally occur. Under this assumption, the goodness-of-fit of a model given a short dataset should be high, unless this subset of the calibration time series includes the occurrence of unresolved processes. Based on this concept, we propose to calibrate the model with a moving window, using BME as a model performance metric.
The analysis starts with the selection of a fixed window size. This window size determines the length of the time-series subset chosen for BME evaluation. With a timewindow size of τ , BME is evaluated on a dataset of τ time steps. Eq. 1 then rewrites as: which transforms BME (Eq. 2) to be tBME and writes as: Note that evaluating p(D Tj (τ ) |M ) according to Eq. 2 means to apply the likelihood function (Eq. 3) to the data subset D Tj (τ ) instead of D.
This time window is then sliding through the observed time series to form a tBME curve as a function of time, which we call tBME curve (see Figure2). We obtain a number of N w individual tBME values that make up this curve. N w is equal to N o − τ + 1, i.e. the first time window "eats up" τ data points of the original data time series, and then this window is shifted time step by time step (j = τ, τ + 1, τ + 2, . . . , N o ) until the final window contains the last data point with index N o (cf. Figure2). The choice of shifting by a certain time step (e.g. typically a day in hydrology) yields the maximum number of possible windows; computational time constraints could justify a shifting of more than one time step at once and thus reducing the number of tBME values to be evaluated. The length of the time step can be arbitrary, depending on the frequency of the measurements. In our application (3.4) we choose one day for one time step.   Figure 2. Illustration of tBME evaluation with a moving time-window. A series of BME evaluation constructs the tBME curve, which is later used as an indicator for model structural error occurrence.

Relevant Time Scales for tBME Analysis
Before discussing the different time scales that are relevant for the tBME analysis, we wish to differentiate between the terms error period and residual period : With error period, we will refer to the time span during which model error occurs (e.g., model lacks a snow-melt routine), while we define the residual period as the time span during which the effect of model error can be observed in form of model residuals (e.g., modeled soil conditions are too dry because input from snow-melt is missing). Therefore, the length of residual periods does not necessarily coincide with the length of the error period itself; instead, residual periods are typically longer in hydrological modeling because errors tend to accumulate over time. Since we do not know the source of model error in real-world settings, we can only detect residual periods with the proposed method and hypothesize about the responsible error periods based on our expert knowledge.
Besides the residual period with length L e , three additional time scales are relevant in our proposed method: the window size τ , the residual recurrence interval R, and the total dataset length N o (see Figure3). The term "residual recurrence interval" stands for the residual-free time interval between two distinguishable residual periods. Note that the two residual periods in Fig. 3 do not have to be caused by identical sources. The only condition for errors to be detectable is simply the residuals' significance.
The observed dataset should be long enough to at least contain a complete residual period L e . Further, the residual recurrence interval R should be larger than the window size τ , so that the residuals caused by separate events/hydrological conditions are not included in the same time window at any point in time. This ensures identifiability of multiple error segments as individual residual periods. Nonetheless, we will demonstrate the usefulness of our proposed method even for superimposed errors (Section 3.4), since this is a phenomenon often anticipated in real-world applications.

Theoretical Behavior of tBME Curves
Fig. 4 schematically illustrates the theoretically expected behavior of tBME for different window size τ and residual period length L e . The sketch at the top depicts how tBME curves behave generally with and without parameter compensation. In each of the four panels below, the top row identifies the respective case as a function of the length of the time window τ (blue box) relative to the residual period length L e (red-shaded box). The plots illustrate the corresponding theoretical behavior of tBME curves. The bottom row summarizes the relationship between the total tBME signal length, denoted as L s , time-window length τ , and residual-period length L e . We provide exemplary numbers for L s , with one unit corresponding to half a gridline increment. In each case, the blue window slides from the residual-free period (high tBME) through the residual period of length L e into the next residual-free period (high tBME again). We assume here that residuals are pronounced enough to produce a clear signal, and that conditions are exactly the same before and after the residual period, so that the tBME curve will climb back to its previous level.
In each panel of Fig. 4, four major states of the theoretical tBME curve can be identified, as the blue time window would slide along the solid red residual period: • State I: The time window is still outside of a residual period and produces a high tBME value. • State II: The time window starts to contain the residual period and tBME drops to a low value. • State III: For the time during which the time window completely captures the residual period, the tBME curve will remain at a low plateau. Potentially, if parameters can successfully compensate model error, tBME temporarily returns back to its previous high value (state III * ). • State IV: The time window starts to move into a residual-free period again and the tBME curve rises to the pre-residual value (state I).
Error signals are defined in State II, when a tBME curve drop can be recognized. It occurs when the time window partially covers the residual period. At this state, the observed data points inside the window are obtained from two different system conditions. Even if there are realizations that fit segments of the data well, there still exists another part of the dataset that cannot be fitted with the same parameter values. As a consequence, goodness-of-fit is reduced for that window, and tBME drops.
In the following, we discuss the differences between the four cases distinguished in Fig. 4. If we assume that parameters cannot compensate for model error at all (there is no parameter realization in the model ensemble that can fit the data well during the residual period), we see a single drop to the low tBME value, no matter whether the win- dow is longer or shorter than the residual period (following the red lines to the low tBME level in Fig. 4). This is clearly visible in Case 1, 2, and 4. The relationship between two specific time scales defines how long the low tBME value will persist: it can be a valley of length L e − τ for τ < L e (Case 1) , it can be a single inflection point in the case of τ = L e (Case 2), or it can be a valley of length τ − L e in the case of τ > L e (in Case 3 and 4).
To some degree, modified parameter values might be able push the model's predictions back into the right direction (for the wrong reasons): parameters have a tendency to "absorb" model error. The degree to which this effect is happening depends on the flexibility of the model structure and the type of error. If the model parameters are able to compensate for model error, tBME may temporarily return to its previous high value (state III * ). In this state, the observed data cannot be fitted well by the model with the original parameter set, but the model still can describe the data well with other parameter values. This hypothesis suggests that the tBME curve will recover within an residual period, if the data before and during the residual period have been generated under the same "model setup". Such type of model error could be treated well with, e.g., time-dependent parameters. Generally, this state is somewhat dangerous, because it might be interpreted as the end of the residual period, although in fact conditions have not changed. It is therefore important to keep an eye on the further development: Our basic assumption is that, during most of the time, there is no significant model error, so typically a stable residual-free period should follow after a tBME climb. If, instead, we observe a second dive right after, we should be warned and suspect state III * instead of state IV.
The timing of the temporary tBME rise can be determined as follows: when the residual period covers half of the time window, this is the worst state in terms of model performance -because parameters have to fit two conflicting states (error-free and errorprone model) simultaneously, which is impossible to achieve. As soon as the window slides further into the residual period, error conditions take up a larger fraction in tBME calculation, and now parameter compensation can start to come into effect, so the tBME value has a chance to increase. It may rise until the window covers the full residual period, and it will stay at this high level until the window starts leaving the residual period. This results in the fact that, just like the lower valley, also the upper intermediate peak (state III * ) might turn into a plateau of different length, depending on the relationship between window size τ and residual period length L e (see Fig. 4).
Finally, once the window has moved out of the residual period completely, the tBME curve fully recovers, i.e. it returns to its pre-residual-period high value and remains there in a stable fashion (state I).
We find that the total signal length L s equals L e +τ , no matter if the window size τ is longer or shorter than the residual period length L e . Hence, by observing the total signal length L s from the analysis with a specific τ , we can directly infer the residual period length to be L s −τ . This has a large practical value and will help our interpretation of tBME curves over varied window sizes. We will illustrate and discuss this further on our case studies in Section 3.
What we did not take into account when drawing this schematic Figure is the fact that tBME calculation will yield larger differences between high and low values for larger datasets, i.e. for larger τ . Therefore, the signals become more evident from larger τ . Finding an optimal window size τ is therefore a compromise between a long enough window to obtain a clear signal, a short enough window to only capture individual error events and avoid "smearing" of different types of model error, and a short enough window to enable fast convergence of the tBME sampling distribution.
In practice, we find that measurement noise and other overlaying effects may distort a clear error indication for window sizes smaller than the residual period length. Our investigations suggest that increasing the window size gradually enlarges true error signals and is hence advisable for reliable error detection (Section 4.1). This is apparent from Fig. 4 especially for τ /2 ≥ L e (case 4): the signal is enlarged to a wider valley, because effective error compensation by parameters is not possible anymore, since the residual-free period always takes up a larger fraction within the moving window than the residual period. Shifting parameter values would lead to a lower performance in reproducing the error-free data, and hence the overall skill metric (here: tBME) would decrease. Naturally, there is an upper limit to increasing the time-window size: once we arrive at τ = L d , there will be only a single window left, tBME will be equal to BME, and we will not have any error signal. When moving from τ ≈ L e toward this upper limit, the signal will start to become weaker at some point because the residuals within the residual period are diluted too much in the residual-free data. Additionally, the computational effort becomes increasingly challenging. Hence, we recommend to test window sizes starting from the anticipated temporal scale of noise and going up to roughly τ ≈ 2L e .

Sampling Distribution of tBME as Reference
To distinguish a "low tBME value" from a "high tBME value" as simplistically labelled in Fig. 4, we wish to construct a reference for the obtained tBME values. If a tBME value falls into the model's "personal BME range", the model's behavior in this time window j is interpreted to be quasi-true; if the tBME value falls outside of that range, statisticallysignificant model error is detected.
To obtain this range, we randomly sample tBME per time window by randomly picking a model realization as synthetic data D Tj (τ ) . We then evaluate the corresponding tBME value by brute-force integration with the complete ensemble minus the picked realization, i.e. with an ensemble of size N M C − 1. Repeating this leave-one-out procedure with the full ensemble (and hence representing the full predictive uncertainty in the synthetic data) will yield a tBME distribution that reflects the predictive variability in the model. This variability is typically due to measurement noise and parameter uncertainty (which is the source of uncertainty that we consider in our case study in Section 3), but could also reflect other sources such as uncertain drivers and boundary conditions.
With the tBME sampling distribution as reference, we perform a series of hypothesis tests: for any specified position of the moving window, does the tBME value given real data fall within the high-density region of the sampling distribution, or is it at the low-BME low-probability tail, or does it even fall outside of the sampled range? It is in the modelers hands to decide for a significance threshold to reject the hypothesis that the model has generated the data (e.g., tBME lower than the xy th quantile of the sampling distribution, or tBME smaller than the smallest sampled BME, etc.). We do not give a general recommendation here, because this significance threshold should be chosen context-and study-specific. We will discuss this further during the interpretation of our test case results, see Section 4.
By observing the dynamic trend of the tBME curve compared with the tBME reference distribution, we can monitor how model performance varies and thus diagnose temporal model error occurrence as shown in Fig. 1. Note that the tBME curve and the reference distribution should be evaluated with the time window of same length to ensure a consistent comparison.

Dynamic Posterior Parameter Distribution
When a model is calibrated with a sliding time-window, a series of posterior parameter PDF samples is obtained "on-the-fly". If varying conditions in different time windows favor significantly different parameter values, we expect to see significant shifts in the posterior parameter samples, e.g. visualized as histograms. Such shifts provide information about potential parameter compensation effects.
We recommend inspecting the series of posterior parameter distributions as a diagnostic tool to learn about the type of model error that was detected with the tBME analysis. If parameters are able to compensate for model error, the model structure and the prescribed parameter bounds seem to be adequate (enough) in principle, and the reason for this apparent time-dependence of parameter values should be investigated in detail. If, instead, parameters are not able to effectively compensate for model error, the model structure seems to be too rigid and inaccurate, which points the modeller to missing processes, wrong assumptions about mass balance components, or similar misspecifications. In this case, the posteriors either run up against bounds of their priors (as pa-rameters could in principle compensate at the cost of taking implausible values), or they do not react at all (as parameters simply cannot address the model error).

Demonstration on a Hydrological Case Study
We demonstrate our proposed approach to model error detection on a hydrological case study from the field of soil hydrological modelling. We first introduce the model and field data in Sections 3.1 and 3.2, then explain the numerical implementation in Section 3.3, define synthetic test cases for our analysis in Section 3.4, and finally present and discuss the results in Section 4.

Soil Hydraulic Model
We use HYDRUS-1D (Simunek et al., 2005) to numerically solve the Richards equation, which describes water movement in a one-dimensional vadose zone. It writes as: where θ is the volumetric water content [-], h is the soil water pressure head [cm], t is time [day], z is the depth [m], K is the unsaturated hydraulic conductivity function [cm day −1 ], and R [cm 3 cm −3 day −1 ] is a sink term when root water uptake is considered. This equation has three unknowns, namely h, K, and θ, and thus requires two constitutive relations for closure: the water retention function θ(h) and the unsaturated conductivity function K(θ). A commonly used description of these hydraulic properties is the Mualem-van Genuchten (MVG) model (van Genuchten, 1980;Mualem, 1976): with the effective saturation where θ r and θ s refer to the residual and saturated water content [−] respectively; α [cm −1 ] and n [−] are shape parameters, K sat refers to the saturated hydraulic conductivity [cm·day −1 ], m = 1− 1 n is an empirical coefficient, and l [−] is the pore connectivity parameter by Mualem (1976). We use a modified form of the MVG model with an air-entry value of h s =-2 cm (Vogel & Cislerova, 1988).
In one of the considered test cases (definition see Section 3.4), we use alternatively the dual porosity model by Durner (1994) that describes the retention properties as a linear combinations of the van Genuchten curves for different pore spaces: where w i are weighting factors for the subclass of the soil with w i ∈ (0, 1) and w i = 1. One of the advantages of the dual porosity model is the similar interpretation of the parameters as in the van Genuchten model (Durner, 1994).

Study Area and Field Data
The field data used in this research is from the Spydia experiment at site in the northern Lake Taupo catchment, New Zealand. Tensiometric pressure head at five depths was measured every 15 minutes with tensiometer probes (UMS T4e, Germany, accuracy 0.5 kPa) at five different depths between 0.4 and 5.1 m , but in this research we only use the field data of the three depths: 0.4, 1.0, 2.6 m. The daily measurements of the top layer (0.4 m, the closest layer to the surface among the three layers) show the most interesting dynamics and are therefore used in this research. Meteorological data from the 500-meters-distant Waihora station is applied for potential evaporation with the Penman-Monteith equation (Monteith, 1981). Detailed information about the field experiment is given in ; Wöhling et al. (2009); Barkle et al. (2011) and is therefore not repeated here.  investigated 7 different soil hydraulic models and found that none of the models fitted to the field data for all time periods, i.e., that all models showed structural error for at least some time period. The study focused on improved predictive performance by Bayesian model average or Bayesian model combination (BMA/C) techniques rather than an the identification of structural errors. In the spirit of model diagnostics and process understanding, it is therefore interesting to investigate the model deficiencies in more detail.

Numerical Implementation
In our simulation, we use the model ensemble by , which consists of N M C = 900, 000 realizations to represent uncertainty in the five MVG parameters θ s , K sat , α, n, and l per soil layer, adding up to 15 uncertain parameters for three soil layers. θ r is fixed to zero, because it is not sensitive in the considered calibration setup . The prior ranges for the uncertain parameters are given in Table 1, with all three soil layers sharing the same parameter bounds. As stated in Section 3.2, we focus on the simulations of pressure head in the top soil layer for model evaluation and model error detection.
For a given dataset D Tj (τ ) of a specific time-window size τ , we determine tBME (Eq. 5) by brute-force Monte Carlo integration (Eq. 2). For this step, we need to define the likelihood function p(D Tj (τ ) |θ i , M ) (Eq. 3). We here assume a measurement error standard deviation of 0.09 cm for the soil water pressure head data. This evaluation is exactly like a brute-force Monte Carlo implementation for traditional Bayesian model evaluation, only that the calibration dataset is deliberately chosen to be very short.
To assess the convergence of our numerical BME scheme, we determine the effective sample size (ESS) (J. S. Liu, 2004). The ESS counts how many parameter realizations significantly contribute to the BME estimate. We also perform a visual analysis of the BME convergence behavior. Both analyses together provide a good indicator of how stable the obtained numerical result is. Although we have monitored BME convergence in all investigated cases, we wish to emphasize that, for real-world applications, a (too) low ESS is not annoying for two reasons: (1) Low ESS arise if only very few realizations show a non-zero likelihood, i.e., if most of the ensemble is performing really bad. This is the situation that we are in fact most interested in: it reveals severe model error. In this case, we will obtain a low BME value; and we do not care whether it is low, very low, or extremely low -the numerical BME estimate does not have to be that accurate.
(2) We do not use the obtained BME value for any further procedure (such as model-averaged prediction), but we are satisfied with BME serving as more of a qualitative model error detector. We are therefore not relying on an accurate estimate of BME for further processing.
To now obtain a complete tBME curve for a chosen time-window size τ , we start by evaluating the first tBME value (tBM E j with j = τ ) for the period of simulation days 1 to τ . Then we slide this window by one day, determine tBM E j for j = τ + 1, and so on, until we hit the end of the full calibration dataset (cf. Section 2.2). In this way, we construct a tBME curve for a given dataset, e.g. for the observed time-series of pressure heads.
We repeat this routine for a number of 30, 000 randomly selected output realizations used as synthetic data, in order to construct the tBME sampling distribution as reference (cf. Section 2.5). Convergence of the sampling distribution was checked visually and by monitoring ESS. Note that ESS does become relevant in this synthetic part of the procedure, because we want the reference distribution to be informative and reliable. We therefore ensure that, during these synthetic runs, ESS does not drop below the value of 200. This is usually not a limiting factor, because the synthetic analysis by definition assures that the model is true, so BME performance does not drop that dramatically, and neither does ESS.
Finally, we repeat this whole procedure (sliding time-window applied to real and synthetic data) for distinct increasing values of window size τ , i.e. τ = 5, 10, 15, 20 days.
For each time window that we use for calibration (and to determine tBME), we obtain a posterior parameter sample as a "side product". We simply weight our prior parameter ensemble with the corresponding vector of likelihoods (dimension 1 x N M C ) and approximate the posterior PDF by kernel density estimation on the weighted sample. As an analogue to the tBME curve, we obtain a series of posterior parameter distributions that we can inspect for further investigation of the potential source of model error.
All our analyses are implemented and performed in MATLAB © R2019a on a standard desktop computer.

Synthetic Test Cases
Before applying our proposed method to the real dataset, we wish to validate the method under fully-controlled conditions. We use exactly the same model setup as in the real case, but construct the dataset of "observations" to be used for the tBME analysis manually. To test different types of model error that could occur in reality, we define three synthetic scenarios where we knowingly introduce model error. The goal is to confirm that our method will successfully identify these predefined error periods; and thereby we wish to demonstrate how to interpret the model error indicator. Then, we finally use the real data to test our approach under realistic conditions.
To generate the synthetic datasets for our analysis, we extract a single realization of model output (soil water pressure head) from the ensemble (base case with no model error, see definition below). In the next step, we perturb this realization with different types of error to generate synthetic datasets that deviate from the model behavior and, hence, model error shall be detected. The perturbations aim to mimic the scenario when a model fails to resolve an only temporary occurring process in the natural system.
While, philosophically, model error is of course in the model predictions and not in the observed dataset, we here pragmatically introduce "errors" into the synthetic datasets. This approach is more straight-forward to implement, easier to interpret, and allows us to modify and arbitrarily increase the complexity of the introduced error without having to set up a new model ensemble for each test case. Whether we introduce the error in the dataset or in the ensemble of predictions does not have an impact on the functioning of our proposed approach.

Base Case: No model error
To demonstrate how our method reacts to the ideal case of no model error, we pick one realization of the model ensemble as synthetic data for the tBME analysis. This realization lies within the high-density range of the predictive prior to represent "typical model behavior" (Fig. 5b). The model parameters of this case are listed in Table 2. The layers 1, 2 and 3 refer to the depth of 0.4, 1.0, 2.6 m, respectively. This dataset is considered as a reference case for comparison for the following synthetic cases 1 and synthetic case 2. For the synthetic case 3, we pick a realization which lies within the lowdensity range of the predictive prior to represent a "low-probability synthetic truth". Also, this parameter set (Table 3) produces pressure head data with similar dynamics to the field data.

Synthetic Case 1: Error in Model Structure
The synthetic dataset of the base case is now perturbed by structural error during specific time periods. We assume that "reality" behaves like a dual-porosity model during these periods (with strong precipitation events kick-starting this behavior, and approximately like a single-porosity model otherwise. Technically, we run HYDRUS-1D with the error-free setup described for the base case, and switch to the dual porosity model setup during three selected error periods: days 31 to 40, 81 to 90, and 161 to 170 (cf. Fig.  6a).
The following points are considered for the selection of these error periods: 1. The error periods do not start before t = 20, which is the largest window size used in our method demonstration. 2. The error period should not be too long so that the time win- Table 4. Dual-porosity MVG parameter values applied to perturb the base-case dataset during error periods in the case of model structural error (case 1); parameters for the first pore subsystem are identical to those given in Table 2 (base case), with w1 = 1 − w2.
w 2 α 2 n 2 Layers 1 and 2 0.85 31 15 dow does not stay inside too long. 3. The error periods cover simulation periods of different spread in the predictive prior. 4. The largest resulting residual per day (i.e., each bar in Fig 10b) is not larger than 0.2 m.
For constructing the perturbed dataset, the parameters K sat , θ s and l remain the same as in the base case; parameter values for the first pore subsystem are chosen as in the base case, and parameter values for the second pore subsystem are given in Table  4. Here the larger values of α and n than in the base case imply that a macropore system is considered as the second soil subsystem. Note that we assign the same dual-porosity parameter values to layers 1 and 2, while layer 3 is assumed to not show a dual-porosity behavior at all.
The ensemble that we test remains the same throughout all test cases: it consists of single-porosity runs (MVG model) with prior parameter ranges given in Section 3.3. Hence, we see significant pressure head residuals during error periods and beyond (Fig.  6b), because the tensiometric pressure responses in a dual-porosity soil structure to heavy precipitation in "reality" is quite different from what our tested single-porosity model predicts.

Synthetic Case 2: Error in Model Forcing
As a second scenario, we consider error in the model forcing. We mimic local effects that are quite common in reality: The nearby meteorological station has recorded precipitation, but the event is such localized (e.g. convective shower), that it actually did not rain at the experimental site. As a consequence, soil moisture measurements show no response (or rather, pressure head becomes even more negative due to the drying soil, see Fig. 10), while our model predicts increased soil moisture due to a heavy precipitation event (pressure head becomes less negative, see Fig. 10). Technically, the synthetic dataset is generated by running HYDRUS-1D in the setup of the error-free base case (Table 2), but the input of precipitation is removed within the error periods (days 36 to 40, 89 to 90, and 166 to 170). This causes significant residuals during error periods and beyond due to the "storage memory" of the soil hydraulic model, see Fig. 10b. Note that, here, the time of the error period (forcing error of, e.g., just one day) is much shorter than the residual period (time span during which the impact on simulations is obvious from the residuals, approximately 10 days).

Synthetic Case 3: Superimposed Errors in Model Structure and Forcing
With the third test case, we aim to study the impact of realistic, superimposed errors. We run HYDRUS-1D in single-porosity mode with the parameter set in Table 3. During the error periods defined in Case 1, we apply the dual-porosity model by switching on w2, α 2 , and n 2 as shown in Table 5. Forcing errors are then superimposed by removing precipitation input on simulation days 44 to 45, 81, 82, 84, 161, 163, 164, and 165. These superimposed error periods also cause superimposed residual periods. This leads to a combined impact on residuals between the synthetic dataset and the model ensemble predictions on different time scales, see Fig. 14b. Both residuals tend to partially cancel out, such that net residuals are temporarily reduced: Switching to dualporosity mode means considering macro-pore flow, which compensates the missing precipitation since water moves quicker to greater depths just like it does when it rains.

Results and Discussion
We first present the results of the tBME analysis applied to the error-free base case, then use our findings for the interpretation of the results of synthetic test cases 1 to 3 (Section 4.1), and finally show the results of applying our proposed approach to the observed field data in Section 4.2.

Base Case: tBME Given a Correct Model
The base case illustrates what the tBME curve should look like if the model is correct. Fig. 5a records the precipitation (model input), and Fig. 5b presents the model's prior predictive distribution (with the gray shaded areas indicating the 68%, 95%, and 99% prediction intervals) and the dataset (blue) used for the evaluation of tBME values. Fig. 5c to Fig. 5f show the tBME curves (red curves) resulting for four different timewindow sizes τ . Their corresponding tBME sampling distributions are shown as blue shaded areas indicating the 68 % and 95 % confidence intervals. Note that we label these as "confidence intervals", because the resampling procedure invokes a frequentist perspective, although our analysis is otherwise embedded in a Bayesian framework. The x-axis of the tBME curve and the tBME sampling distribution is shifted by τ days. For example, the first tBME value of Fig. 5c is plotted at t = τ = 5, which is evaluated with the observed data from the first day to the fifth day (cf. Fig. 2).

Model-Internal Variability of tBME
As shown by the red curves in Figures 5c to 5f, all tBME curves fall inside the range of their tBME sampling distributions during all time periods, which confirms that no significant model error occurs. This meets our expectation, since this dataset is generated by the model.
In Fig. 5c, the tBME curve of τ = 5 contains various drops. This is due to the model's parameter uncertainty: even though the base case dataset is a model realization sampled from the high-density range of the predictive prior, other realizations in the ensemble do not necessarily follow the exact dynamic trend of this individual realization. This effect can be observed when the model is calibrated with a small window size. With increasing time-window size, those small deviations become less influential, because the 68% tBME CI 95% tBME Figure 5. Results for tBME analysis if the model is correct (error-free). 68 % PPI denotes prior prediction interval; 68/95 % tBME CI denotes 68/95 % confidence interval bias from short-term effects is diluted over a larger window size. This can be observed in the tBME curve for τ = 20, which is smoother than the one for τ = 5. The important conclusion is: no matter how wiggly the tBME curve, if it stays within the tBME sampling range, there is no substantial (statistical) reason to believe in model error -it can just be due to parameter (or other sources of) uncertainty.

Dynamics of tBME sampling distribution
It is worth noting that the effects of parameter uncertainty is increasing over time in our case study (Fig. 6a), leading to wider tBME sampling distributions at later times ( Fig. 6c to 6f ). This demonstrates that the criterion to trigger an error signal (falling below a certain quantile of the sampling distribution) is dynamic in time.

tBME Threshold for Error Detection
The challenge is hence to identify a meaningful lower threshold for tBME to label a drop as "model error occurrence". As noted in Section 2.5, there cannot be a general guidance for all cases, but rather we recommend to perform this base-case test before interpreting the results of real tBME curves as done here: by choosing a realization from the high-density region of the predictive prior with a small window size, modellers can check to which quantile of the sampling distribution the tBME curve drops. This level is the highest one that should be used for model error detection (to avoid false-positive error detection); rather, we would expect so see much more pronounced drops with real data if severe model error occurs. Looking at increasing window sizes τ will help gaining confidence in the choice of the threshold. This is discussed in more detail in the context of test case 1 in the following Section.

Case 1: tBME Given Model Structural Error
Our first synthetic test case demonstrates the time-windowed BME analysis to detect model structural errors. Fig. 6 presents the results in the same order as for the base case (Fig. 5).
The red-shaded boxes in Fig. 6b represent the error periods during which the "truth" is simulated with the dual-instead of the single-porosity model (see Section 3.4). These time periods now feature an "observed" process that is not contained in the model ensemble.
The corresponding perturbed dataset is shown by the solid blue line. The base-case dataset is shown as dotted line for comparison. The difference between those two datasets (i.e., the residuals between the model realization corresponding to the base case dataset and the dataset now used for case 1) are plotted as gray bars. These residuals provide an impression of how wrong the model ensemble is compared to the given dataset (with variations among model realizations due to parameter uncertainty). Recall that we use the term residual period to clearly distinguish this time span of error-induced residuals from the actual error period.
It becomes apparent that the first two error periods only really have an effect when precipitation occurs at t = 32 days and t = 82 days, while during the third error period, residuals are accumulating and are intensified during moderate precipitation from t = 165 days. Note that the residuals persist longer than the actual time span of error (residual period longer than error period) due to memory effects caused by water storage.

Error Detection
Figures 6c -6f present the tBME curves for the four chosen window sizes. The tBME curves for τ = 5 and τ = 10 show various small drops, which are still within the bounds of the corresponding tBME sampling distributions. This means, the model calibrated on periods of five or ten days still fulfills the statistical condition of the model to be quasitrue. With increasing window size, the signal of error occurrence becomes more pronounced. At t around 40 and 100 days, the tBME curves of τ = 15 and τ = 20 both leave the high-density interval and even the complete range of the tBME sampling distribution, which indicates significant error occurrence. Ideally, a clear turning point can be observed when the time window enters an residual period, as shown by the immediate decline at t = 33 and t = 83 days in the tBME curve for τ = 20 (Fig. 6f).
By inspecting the tBME curve in more detail (Fig. 7, for the example of τ = 10), we identify the four major states of the tBME curve introduced in Section 2.4: -21-manuscript submitted to Water Resources Research Residual Figure 6. tBME curves given model structural error (synthetic case 1: reality switches to dual-porosity behavior, model stays in single-porosity setup).
• State I: The time window is outside of an residual period and the tBME curve stays within the high-density interval of the tBME sampling distribution. • State II: The time window starts to contain the residual period and the tBME curve declines. • State III * : For the time during which the time window completely captures the residual period, tBME temporarily returns back to the high-density interval of the sampling distribution. This shows that the parameters successfully compensate model error. • State IV: The time window starts to move out of the residual period (residual-free) and the tBME curve rises to the high-density interval of the sampling distribution. For easier recognition, we have marked and color-coded the respective time windows and corresponding tBME values in Figure7b and 7c. The same four states can be identified for window sizes larger than the perceived residual period length (e.g., for τ = 20, Fig. 8), with the only difference that state III * (compensation) is less pronounced. This is because the time window used for tBME calculation will never only contain the residual period, but will always partially contain an residual-free period. Hence, although compromise parameter sets are preferred, they will not achieve a high-density tBME value. With τ = 20 ≈ 2L e , we would have expected a flat plateau of low tBME values according to our theoretical considerations in Section 2.4; Instead, we see a slight intermediate peak of the tBME curve in Figure 8b (day 95). A possible explanation could be the wetting event at t = 90 days that seems to alleviate the model's difficulty to fit the data; a similar effect can be observed for the next wetting event after t = 100 days. Since the residuals are due to a lack of macro-pore flow behavior in the model (which would make water achieve greater depths faster), a wetting event helps to put the model into conditions that are more similar to the synthetic data. This can also be seen from the residuals plotted as gray bars: their accumulation over time is abruptly stopped with the precipitation event at t = 90 days. If we look at the dip of the tBME curve, it starts decreasing at t = 82 and returns at t = 111. The total signal length is hence 29 days (111-82), which corresponds to the theoretically expected length of τ + L e .

Sensitivity of tBME to Window Size
In this test case, the misfit between the model ensemble and the data is around 0.1 m on average and spans a period of 10 to 15 days. The comparison of the four tBME curves reveals that a time-window size of 50 to 100 % residual period length is here not yet conclusive, because such a short period is not sufficient to disentangle the error signal from measurement noise (which is chosen to be on a similar order of magnitude as the residuals here) and model-internal variability due to parameter uncertainty. Also, under real-world conditions, the tBME curve for a narrow window is expected to be sensitive to short-term deviations, as daily deviations take up a larger fraction of the total residuals. For instance, a short-term variation in the system, e.g., misfit during one day, can have one fifth contribution in a time window of five days. Therefore, major signals might not yet be distinguishable from noise in the tBME diagnosis with small window sizes.
The BME indicator strength increases with the calibration time-window length, because the task of fitting all contained data points well with any single parameter combination becomes increasingly hard (especially, since we assume uncorrelated errors in our likelihood, cf. Eq. 3). Hence, wider time windows smooth out the noise, if the error duration is longer than the typical scale of noise (or unresolved, short-term processes of less importance to the modeller), and enlarges true error signals. This is what we observe here: when the time window is larger than the actual residual period (e.g., τ = 20 days), the problematic periods can be clearly detected.
Overall, we take away from this first test case that a variation of τ can be very insightful and strengthens our interpretation of results.

Dynamics of Posterior Parameter Distributions
Alongside the tBME curve, we obtain a series of posterior parameter samples for each time-window size τ . We can inspect how each parameter's posterior PDF changes with increasing or decreasing tBME. Figures 7 and 8 show the posterior PDFs of all five uncertain model parameters corresponding to the five illustrated time windows for window sizes τ = 10 and τ = 20 days, respectively. We observe the clearest trends in parameter value shifts for short window sizes τ , because parameters can be fine-tuned to very short calibration datasets, and generally do not have to satisfy varying conditions simultaneously.
For τ = 10 in Fig. 7, we conclude that the shape parameters α and l are most sensitive to the residual periods: During phases when dual-porosity behavior would be needed to adequately mimic the data, α and l tend to take on higher values than during error-free (single-porosity) periods. We nicely see the error compensation attempt in those two parameters with their posterior PDFs shifting from state I to state III * (tBME returns to the high-density region due to parameter compensation) and back.
These two parameters have the effect of scaling the water retention function. Higher values correspond to lower water contents for a specified pressure head or higher pressure for a specified water content, respectively. To confirm this conclusion, we look at maximum-likelihood water retention curves obtained in the five states discussed above (Fig. 9a). We also show the true water retention curves for the error-free base case and test case 1 for comparison. As expected, we see a shift of the water retention curve from state I being close to the base case toward case 1, induced by higher α and l-values that better represent the water retention function of the dual-porosity model parameterization. Hence, although the synthetically-introduced model error here is related to the model concept, the model manages to compensate the error by adjusting those parameters that determine the shape of the water retention curve.   Note that the information content of the calibration data for identification of the true water retention curve is limited, as can be seen from Fig. 9a: the pressure head observations in each state (data in the five different time windows) are marked as crosses on the corresponding simulated curve. These data mostly cover the dry range, where the water retention curve becomes a vertical line. None of the pressure head observations in that range is informative, because differences are very subtle and hence the true water retention curve cannot be identified. This is a particular result of the highly porous volcanic soils present in our case study region . In these soils (similar to sand), the water content drops dramatically over just a narrow range in matric potential.
Hydraulic conductivity as a function of pressure head (shown in Fig. 9b) is dominated by the parameter compensation effect that causes the curves to shift from state I closest to the base case toward the curve of case 1 for states II and III * . The unsaturated conductivity of case 1 is much lower than the base case, leading to a slower movement of infiltration fronts and larger pressure heads.
As expected, parameter compensation effects become less dominant if the window size is increased beyond the residual period length (illustrated for τ = 20 in Figs. 8 and A2 in Appendix A). Parameters now have to find an optimal compromise solution whenever (parts of) the error period are contained in the calibration time window, and this reduces the shifts in PDFs and water retention curves or unsaturated hydraulic conductivity functions between time window positions. Yet, differences between the individual parameters can still be found, which can help gain further understanding of how these parameters act in the model and how they possibly compensate for model error.

Case 2: tBME Curve Given Forcing Error
The second case illustrates the analysis of tBME curves in the presence of forcing errors. Fig. 10 shows the forcing (precipitation), the synthetic perturbed dataset, the error-free base case dataset for comparison, and the resulting tBME curves together with their sampling distributions for the four window sizes τ . Recall that, to construct the dataset for this test case, we have run HYDRUS-1D with the base case setup but deleted the precipitation during the pre-defined error periods, which are marked with red-shaded boxes in Fig. 10a.

Residual
Error period Figure 10. tBME curves given forcing error (synthetic case 2: precipitation marked in red has not happened at the field site, but is used as input for our model).

Error Detection
It becomes apparent that only the first two error periods lead to a significant accumulation of residuals between perturbed data and the model's base case realization. Both periods are clearly detected by the tBME indicator at window sizes larger than five days, with tBME leaving its high-density region and, for the error period with largest residuals, even the complete range of the sampling distribution.
The theoretically expected behavior of the tBME curve can be seen beautifully at the example of τ = 10 < L e in Fig. 11c. The fall and rise of tBME takes approximately τ /2 = 5 days, and the intermediate plateau of high tBME values persists for exactly L e − τ = 3 days. The second dive is somewhat distorted, with a fast fall and a slow rise. This effect might be due to the precipitation events that hit the system right in this "recovery" period.
Error period Figure 11. Zoomed-in view of tBME curve and dynamic posterior parameter distributions for window size τ = 10 (Synthetic case 2).

Sensitivity of tBME to Window Size
As discussed in the context of case 1 (model-structural error detection), moving towards longer time windows increases true error signals and flattens out noise and fluctuations of the tBME curve due to parameter variability within the model. Further results refer to Appendix.
Based on the two investigated test cases so far, we conclude that the selection of an appropriate window size τ depends on many factors, such as magnitude of model residuals, level of measurement noise, duration of residual occurrence, and internal variability of the model ensemble. Although many of these factors might be unknown in practice, it helps if the initial τ selection is based on (expert) knowledge of the expected error and its duration, which hydrologists or generally modellers typically have some ideas about. If expert knowledge is not available, we recommend analyzing tBME curves with   a prior selection of window sizes to assist model error detection and and to provide insights about temporal scales of model error occurrence.

Dynamics of Posterior Parameter Distributions
We find that also in the case of forcing error, a compensation effect is achieved by shifting parameter distributions. This can be seen from the climb of the tBME curve for τ = 10 days around t = 100 days, and the second drop right after (Fig. 10d). Around this point in time, the time window almost exclusively captures the residual period, and parameter values are reweighted in Bayesian updating to achieve an optimal fit of the perturbed data. A rise of the tBME curve during an residual period can only be explained by shifting parameter values, and this explanation is confirmed by inspecting the dynamic posterior parameter distributions for selected windows of size τ = 10 days in Fig. 11.
The MVG shape parameters α and n seem to be responsible for error compensation: both take on distinct low values only during the residual period. Inspecting the maximumlikelihood water retention curves obtained for the five states, we find a clear shift toward higher water content at a given pressure head during residual periods. This is the only possible way for the model to fit the observed pressure head data despite the surplus of precipitation in the soil. It is achieved by the interplay of low α, low n and high θ s . Similarly, hydraulic conductivity is shifted to higher values at a given pressure head. This leads to faster drainage of the soil profile which there in turn will result in lower theta and higher h.This is clearly a compensation effect.
To find out whether these successfully compensating parameter sets are clearly identifiable from calibration within the short window of 10 days, we additionally show the 95% posterior credible interval of the water retention curves and the unsaturated conductivity functions of states I and III * in Fig. 13. The intervals of both states are extremely large (much wider than in case 1, cf. Fig. A3 in Appendix A) and allow for a wide range soil hydraulic functions. This warns us that there is no dominant pattern of error compensation -which is plausible since model error is caused by model forcing in this test case, not by modifications of the model parameterization.
Sliding out of the residual period, it seems that K sat and θ s are now taking care of the excess water in the model (since it had to cope with precipitation input that is missing in the true system). K sat is clearly increased during this phase (and rather insensitive else), and θ s is reduced (cf. posterior PDFs in Fig. 11). The model seems to struggle when going back into the residual-free period, because what happens in the perturbed data is that the current saturation is taken as initial value to continue calculations with the original parameter setup. But the model is in a different state because it tried to compensate for excess water and hence carries some "history" with it. On top, another heavy precipitation event is hitting the catchment during the recovery phase after t = 100 days. One might expect that this event should equalize the tensiometer traces; but since the model is not doing very well during infiltration events anyway, the original dip in the reference BME curve is enhanced. All parameters are shifting in a way distinct from all other phases (green time window and PDFs in Fig. 11). Both effects combined might explain why pre-and after residual-period parameter values are not the same for θ s and K sat , and why water retention curves look different between state I before the residual period and state I after. But, importantly, K(h) has gone back to the same state I as before.
We conclude that inspecting the dynamic posterior parameter PDFs, posterior water retention curves and posterior unsaturated hydraulic conductivity curves including their remaining uncertainty can help put the detected model error into context. The sensitivity of individual parameters to the sliding calibration time window can provide information about the type and source of error.

Case 3: tBME Curve Given Superimposed Errors in Model Structure and Forcing
In reality, model errors occur at an unexpected frequency and at various temporal scales, so multiple residual periods are expected to overlap each other. The analy-sis of multiple superimposed errors is demonstrated with the setup of case 3 described in Section 3.4. Fig. 14a records the forcing (precipitation); Fig. 14b shows the synthetic perturbed dataset and the error-free base case dataset for comparison. The red-shaded boxes in Fig.  14b mark the periods when the model suffers from structural error (reality is in dualporosity mode), and the red-shaded boxes in Fig. 14a show the periods when the model suffers from erroneous forcing. The model misfits caused by these two errors partially cancel out. The net residuals are presented by the gray bars in Fig. 14b. The resulting tBME curves are plotted together with their sampling distributions for the four window sizes τ in Fig. 14c to Fig. 14f. Figure 14. Results of tBME analysis if model structural and forcing errors are superimposed (synthetic case 3).

Error Detection through Varied Window Size
Consistent with the previous two cases, we do not perceive clear signals from the shortest window size τ = 5 days. From τ = 10 days on, we see strong signals during all three residual periods; and these signals are massively intensified when increasing the window size toward τ = 20 days.
Recall that the dotted lines represent the tBME curves of the model realization that was perturbed to generate the synthetic data set for this test case 3 (labelled with "errorfree tBME" in Fig. 14c to Fig. 14f).
Note that this dataset is chosen to be a different one than the base-case dataset underlying cases 1 and 2; with the motivation to now move as closely to realistic dynamics as possible. The realization used here does not fall within the high-density region of the prior predictive distribution (cf. e.g. days 160 to 170 in Fig.14b), so short segments of the error-free tBME curves fall outside the shown intervals of the BME sampling distribution (e.g. days 160 to 170 in Fig.s 14c and 14d). Note that, per construction, it must fall inside the sampling distribution, just at a very extreme tail. However, we observe that this signal is not enlarged when increasing the time-window size. This finding again supports our faith in the method: true model errors will be detected by varied window sizes through drastically intensified signals; model behavior within the statistically plausible range will not produce a strong signal.
The superposition of errors can be seen when comparing the tBME curves for τ = 10 days with the tBME curves for larger window sizes during days 30 to 70: a time window of ten days exactly separates the two residual periods (the residual caused by structural error first, the residual caused by precipitation input error second); at larger time windows, these two signals triggered by the two residuals merge. This points us effectively to the time scales of both errors, which are about ten days of large residuals each.
When one residual period overlaps the other (days 80 to 90), we see a double peak for τ = 10 days, and a merged signal for larger window sizes. Inspecting the double peak in more detail in Fig. 15 reveals that the rise of the tBME curve after its first dive into the residual period comes too early -theoretically, we would expect it at τ /2 days after the onset of the signal because τ ≈ L e (cf. Section 2.4); i.e., at t = 88 days. Here, instead, we see a rise already at t = 86 days. This is due to the fact that the superimposed errors cancel out to some degree, and hence, tBME has a chance to rise again even earlier. As expected, it arrives at its temporary peak at L e days after the onset of the signal (because then, the residual period is completely captured by the window), here at t = 90 days. Sliding out of the residual period again leads to a decline of tBME, which should arrive at its intermediate minimum after τ /2 days, but again is too early here. This is why the rise to the end of the error signal appears very long.
The steep decline and slow return of the tBME curve is indeed the very characteristic of superimposed, counteracting errors: the partial cancelling of residuals affects the originally expected tBME behavior for individual errors in that way. The effect becomes even more clear if we carefully think about which timing to expect here for an ideal case of a single error signal: since τ is in fact slightly larger than L e , the two falls of the tBME curve should last τ /2 days, and the two rises should last L e −τ /2 days; i.e. the fall would theoretically take longer (five days) than the rise (two days) to a plateau of three days length. What we see is the opposite behavior, and this is due to the counteracting effect of the error sources.
Also, the merged signals for larger τ in both residual periods (days 30 to 70 and 80 to 90) look distinct and can be distinguished from the case of parameter compensation as seen in cases 1 and 2. In the case of parameter compensation during a single-error period, we observed a stable plateau in the tBME curve when increasing the window size. Here, we see distinct steep drops from such a plateau. Similar to our interpretation above -32-manuscript submitted to Water Resources Research for τ = 10 days, we find that the fall to the minimum plateau, expected to occur for t = 91 to 103, seems to be dampened (τ = 20 days, Fig. 16). The plateau is interrupted by an additional dive, before the tBME starts to climb back up as expected from t = 104 days on.
We found that determining the residual period length as total signal length L s minus τ yielded seven days, which only identifies the period of large superimposed residuals. Hence, for directly superimposed residuals, inferring the individual time scales will be hard (at least for the less pronounced residual), but varying the time-window size together with low-noise data could help identify also this one.

Dynamics of Posterior Parameter Distributions
When inspecting the dynamic posterior parameter distributions for this test case 3 at a window size of 10 days (Fig. 15), we see attempts to parameter compensation that seem disrupted: judging from the tBME value during state III, a parameter shift can accomplish a return to the edge of the high-density region of the tBME sampling distribution; however, this shift is not consistent throughout the residual period. For all parameters but α, the first reaction of the parameter distributions during state II produces much more pronounced shifts away from state I (residual-free period) than during state III. This again confirms a "mitigating" effect of error superposition in this case. The role of α has indeed not changed -it takes on very small values during error compensation as in test case 2 (erroneous forcing), but the interplay with n seems different here.
We further inspect the maximum-likelihood water retention curves and unsaturated hydraulic conductivity functions for the five states in Fig. 17. Also here we see mixed compensation signals from the two previous cases.
The fact that α clearly shifts to counteract the forcing error, but not the structural error (α moved to higher values in test case 1), combined with the much larger shift of the water retention curve towards higher water contents as in case 2, suggests that the forcing error is playing the dominant role here. Such interpretations thus help identifying the source and role of model errors. For interpreting real-world analysis results (as to be done in Section 4.2), we can thus recommend to create synthetic scenarios with hypothesized errors to investigate the corresponding outcomes of tBME, posterior parameter distributions and posterior soil hydraulic functions for comparison.
For τ = 20 days, we see more consistent parameter compensation attempts in both marginal parameter PDFs (Fig. 16) and soil hydraulic functions (Fig. A10). "Averaging" of the impact of both errors seems to produce clearer compromise solutions in parameter values. α is decreased during the residual period, n is increased during the residual period. Again, this represents a mitigation of the forcing error, not so much of the structural error. θ s and K sat show trends of increased saturated water content and decreased conductivity during (and after) the residual period, which is similar to our findings from case 1 (structural error), and not case 2 (forcing error).
Note that the error timing in case 3 is different and does not allow for a detailed 1-to-1 comparison: while the structural error period in this third test case is the same as in case 1, the forcing error has been chosen to overlap, so it is earlier in case 3 than in case 1. This leads to a severe precipitation event hitting the catchment during its error recovery phase in case 3; in case 2, this event was subject to forcing error inside the   error period. Hence, both scenarios naturally request for different strategies to bring the model back into a state that matches the original error-free data set well.

Summary of Findings from Synthetic Test Cases
We found that tBME curves can be quite wiggly due to parameter uncertainty or other factors that increase the dynamic and variability in ensemble predictions. Also, the tBME sampling distribution can show dynamics over time. In conclusion, no matter how wiggly the tBME curve obtained for real data, if it stays within the tBME sampling range, there is no substantial (statistically significant) reason to believe in model error.
We do not give any general recommendations on a significance threshold for "low tBME" as an indicator for error occurrence; rather, we encourage the modeler to test increasing window sizes and identify those periods that consistently produce clearer signals. For more confidence, the modeler could perform synthetic tests as demonstrated here to learn about the model's tBME reaction to a certain level of error under idealized conditions.
Results have shown that different types of errors and even superimposed errors can be detected with our proposed method. The interpretation of the tBME behavior is guided by the theoretical considerations in Section 2.4. We can, e.g., easily identify the residual period length from visual inspection of "tBME valleys". We found that mostly the tBME curves in our test cases followed the theoretically expetected behavior closely; however, changing conditions (such as in forcing) make the tBME curve leave its theoretical trajectory. Also, superimposed errors that tend to cancel out have a specific impact on the tBME curve: they produce a steep decline and a slow return. Hence, by comparing the actually obtained tBME curve with the theoretical sketches and by paying attention to other factors that could change the model ensemble's behavior before and af-ter an error period, we can extract valuable information about the type and duration of model error.
Analyzing of dynamic shifts in time-windowed posterior parameter distributions allows to gain an understanding of how parameters possibly compensate for model error. These insights may, in turn, provide avenues for model refinement. In the specific case of soil hydraulic modeling, interpreting posterior water retention curves and unsaturated hydraulic conductivity functions proves highly valuable. In the considered case study setup, the range of "synthetically observed" data values covered only a very small (dry) portion of the full saturation range and hence, data are only little informative for identifying most plausible soil hydraulic parameter values. Although this is a well-known and important phenomenon, it is often forgotten during the analysis of hydrological model results.
Potential users of our proposed method might ask themselves whether just looking at residuals would be sufficient to obtain a model error signal, avoiding the hassle to go through Bayesian updating altogether? Unfortunately, inspection of residuals only would not do the trick: remember that what we showed as the gray bars in Figs. 6 to 16 are the residuals between the error-free data set and its perturbed version -this is unknown in practice. We would have to look at average residuals of ensemble predictions vs. data instead, or at best-fit-residuals, and both can be misleading if parameter compensation occurs and/or superimposed errors tend to cancel out. Thus, it is beneficial to make use of the Bayesian framework: it allows error detection via tBME and deeper interpretation with respect to error compensation based on dynamic parameter posteriors.

Results of tBME Analysis for Real Observation Data
We now use the actually observed Spydia data to construct the tBME curves and wish to identify model error in our tested ensemble. The dataset has not been perturbed with extra errors. Note that the overall time period of the calibration dataset and the simulation is the same as in all synthetic cases; hence, we can use our findings from the base case and the test cases 1 to 3 to help us interpret the results from this real casesince the source of error is obviously unknown now. This case is a demonstration of the potential usefulness of our method under realistic conditions, while the synthetic cases have served as a validation of our approach.

Error Detection through Varied Window Size
The drops in the tBME curve of τ = 5 indicate multiple potential error occurrences, which could, however, be clouded by measurement noise as noted in the synthetic test cases. Compared with the observations, we notice that drops in the tBME curve always occur when a recession phase abruptly changes to a wetting event, i.e. when the soil water pressure jumps to higher moisture contents. This implies that the model cannot describe highly dynamic responses well, and consequently tBME goes down.
With the wiggly nature of the tBME curve under real-world conditions, the sampling distribution proves to be very helpful for interpretation. That is also why we designed synthetic case 3 such that it mimics real-world dynamics, to make ourselves familiar with interpretation of such curves. 68% tBME CI 95% tBME Figure 19. Results of tBME analysis for real field data.
To distinguish between statistically significant errors and noise, we search for enlarged signals along the vertical axis of Fig. 19 (tBME curves of increasing τ ). From the tBME curve for τ = 20, we observe amplified signals that fall out of the tBME sampling distribution during days 34 to 53, 75 to 96, and 105 to 128, which are considered as error signals. The interval between days 130 and 160 does not show noticable model errors, since the tBME curve does not shift downward significantly with increasing window size.
The tBME curves of larger window sizes do not indicate well-separated single peaks, which suggests that either the time scale of model error is on the order of ten days or less; or model error events are overlapping and hence leading to a smeared out signal. Both hypotheses are probably true if we assume that model error is related to the time of wetting (one to three days): with increasing τ , often the next event is already included in the window.
In Fig. 20, we can inspect the dive of the tBME curve between days 103 and 118 in more detail. The model clearly leaves the high-density tBME trajectory for a period of L s = 14 days. Recalling the theoretical properties of the tBME indicator, the underlying error period (time period with significant residuals) should be of length L s − τ = 4 days. Further, the pattern of the tBME curve looks more similar to superimposed errors (cf. synthetic case 3) than to individual errors (cf. cases 1 and 2). Again, superimposed, short-term errors seem plausible here if we assume that they are related to soil wetting, given that several days of precipitation with varied intensity occur during this period.
For simulation days 160 to 170, we perceive a signal that has a very clear, familiar shape, with a widening valley for increasing time-window size. This signal formally does not leave the reference interval obtained by sampling tBME from the model; however, the sampling range is here very wide due to the increased model-interval variability. We therefore consider this to be a sign of model error -during this period, the model structure itself may not be disqualified as "quasi-true", but most of the parameter ensemble is. Certainly, the model is struggling here, and reasons should be investigated.

Dynamics of Posterior Parameter Distributions
Inspecting the dynamic posterior parameter distributions for τ = 10 days in Fig.  20, we see clear attempts to parameter compensation: the MVG shape parameters α and n both are increased during the low-tBME period, the pore connectivity l is decreased, and low values of saturated water content θ s are preferred paired with higher conductivity K sat . This reaction is similar to the pattern of parameter compensation that we have seen in test case 1 (structural error, missing dual-porosity component in the model).
Yet, judging from the tBME curve, we conclude that these compensation attempts through shifting parameter values are not highly successful. They seem to dampen the fall of the tBME curve to some degree, but they cannot bring its value up to the higherdensity region of the model. This underlines that the model most likely suffers from structural error related to the wetting phases (e.g., missing macropore flow, surface runoff, hydrophobicity) that needs to be treated by modifying the model structure; time-dependent parameters could here only help to a limited extent.
When inspecting the dynamic water retention curves and unsaturated hydraulic conductivity curves in Fig. 21, we find that their shape in most states does not resemble the typical shape that our model ensemble favors. This supports the hypothesis that a structural error dominates which is not related to parameter misspecification. The remaining uncertainty after calibration on this short time window is large (see Fig. 22), but still the maximum-likelihood curve of state I falls outside of the 95% credible inter- val. This emphasizes the struggle of the model to fit the data well, even during periods with a high-level tBME value.
The discussed findings are confirmed when inspecting the posterior parameter distributions for τ = 20 days in Fig. 23. For this window size, we see a total signal length of about 24 days, which yields the same suggested residual period length of four days. Also the steep return to the higher-density region of our model is at the same relative position and suggests that the model is coping much better with the period after t = 109 days. This exactly corresponds to the transition between heavy precipitation and drier conditions, and supports our hypothesis that the model is much better at representing soil water redistribution (soil drying) than at simulating infiltration events (soil wetting).
Within these longer time windows, α and K sat seem to try and compensate heavily during the recession phase of the precipitation event at t = 108 days. The other parameter distributions look similar to those for τ = 10 days. This suggests that averaging over longer periods of time does not require opposite directions of parameter compensation, and hence indicates potentially superimposed errors of the same type. This again supports our hypothesis of structural errors of our model related to infiltration processes. To improve this model, we could allow the macropore flow fraction w2 as another uncertain parameter, and see whether it could effectively compensate most of the detected model-structural error. Given the current error patterns revealed by our analysis, we would expect that w2 would be larger than zero during the residual period, and with the effective soil subsystem the model misfit would be improved. However, is beyond the scope of the current study.

Summary and Conclusions
Model structural errors are omnipresent in hydrological modelling. Identifying such errors is vital to deepen our system understanding, to adequately judge the skill of hydrological models, and to improve their forecast reliability in subsequent research. We propose a data-driven method to detect time-dependent structural errors. Our method  builds on the statistically rigorous Bayesian framework and does not require any prior assumptions about the error type or pattern. The novelty of our method is to apply a sequence of Bayesian hypothesis tests by sliding a calibration time window through the given dataset. By testing the assumption that the model has generated the data along the time axis, we can identify the time points when the model fails to describe the system sufficiently well. The core idea of our approach is hence to perform a time-windowed Bayesian analysis that identifies short periods in time where the model statistically disqualifies itself as (pseudo-)true.
We track model performance, measured as Bayesian model evidence (BME), over time to detect the onset of structural error, as well as potential error compensation mechanisms during the error period. As a threshold for reliable error detection, we introduce reference intervals for time-windowed BME (tBME) by sampling the model's predictive distribution as synthetic datasets. If a tBME value falls into this reference range, the model's behavior in this time window is interpreted to be quasi-true; if the tBME value falls outside of that range, statistically-significant model error is detected. If tBME drops from the high-density region of the sampling distribution to a very low quantile, this can also be suspected a sign of model error; statistically, this does not reject the hypothesis that the model structure has generated the data, but it rejects a large part of the model's prior parameter space for that time window.
Our proposed approach is designed to detect error occurrence on various temporal scales, which is especially useful in hydrological modelling. As a side product, we obtain a time sequence of posterior parameter distributions (or of best-fit calibration parameters) that help investigate the reasons for model error and provide guidance for model improvement. Inspecting the dynamic parameter distributions (and, in the specific soil hydraulic context: posterior soil hydraulic functions) allows to put the detected model error into context, and the sensitivity of individual parameters to the sliding calibration time window can provide information about the type and source of error.
We have first provided guidance on how to interpret resulting tBME curves theoretically: their expected behavior is determined by the interplay of the chosen time-window length τ , the residual period length (duration of significant residuals between model and data) L e , and the residual recurrence interval R. The proposed method works best under the following conditions: 1. The data set should cover the full residual period to be detected. 2. The signal-to-noise ratio in residuals needs to be reasonably high (and ideally error and noise show on different time scales). 3. If potential parameter compensation effects shall be detected, the sliding time-window needs to be shorter than the residual period. 4. If multiple errors shall be identified individually, the recurrence interval of residuals should be larger than the time-window size (although superimposed errors can also be detected, probably with varying degree of clarity in practical applications).
Since the tBME curve purely reflects the residuals, its interpretation regarding potential error sources requires expert knowledge due to potential delays of the system response (e.g., time lags due to storage effects).
Then, we have "calibrated" the reader on how to interpret tBME curves under known error scenarios by designing three synthetic test cases that vary in the error source and error time scales, including a scenario of superimposed errors. Since superposition of two errors of the same sign (both overestimating or underestimating the target value) naturally increases the net residuals (and simplifies the task for our proposed model error indicator), we here deliberately picked error types that could potentially cancel out.
The test cases are inspired by a real case study on soil hydraulic modeling of the Spydia experiment site in the northern Lake Taupo catchment, New Zealand. We have used daily measurements of tensiometric pressure head of the top soil layer to detect (1) hypothetical structural error (single-porosity model vs. dual-porosity system behavior), (2) forcing error (model input not representative of actual precipitation on-site), and (3) a superposition of both error types. Finally, we have presented insights from applying our proposed method to the actual field observations.
Results of our synthetic test cases demonstrate that we can detect the start and end of residual periods for all investigated error sources and temporal scales. Even for superimposed errors that tend to cancel out, we are able to reveal the "fingerprint" of each error with our method. With the sampling distribution of tBME as a reference, we can clearly differentiate between true error signals and model-internal variations that lead to fluctuations in the tBME curve. Varying the time-window size τ has been found to be very insightful, as it intensifies true error signals and filters out noise.
The theoretically expected patterns and the timing of the tBME curve could be discovered from all investigated test cases. Of course, our theoretical considerations hold for idealized, constant residuals during the error period; in reality, we have arbitrary patterns of residuals that affect the shape of the tBME curve. In hydrology in specific, we tend to have rising and falling residuals due to the memory in the system and the model. This means that the timing of, e.g., the lowest point of tBME might be different in real cases. However, we found that the expected patterns are surprisingly clear to identify even from real data. Hence, we can use the simple calculation L e = L s − τ to determine the temporal scale of the detected error, i.e., the length of the residual period (producing significant residuals between model predictions and observed data) is equal to the total tBME signal length minus the chosen length of the moving time window.
In application of the real dataset, our analysis pointed to the model's weaknesses with respect to soil moisture recession and revealed very short temporal scales of error. Superimposed errors of longer scale were not identified; overall, the hypothesis that this model has generated the data could not be rejected during most periods. This, in turn, means that the model is doing respectably well most of the time under the assumed scenario of measurement error and parameter uncertainty. For real-world predictions, we might want to reduce both sources of uncertainty to obtain more precise predictions; that, however, could possibly lead to statistical rejection of the model during larger time periods, because then the hypothesis test through the tBME analysis becomes more strict. Such investigations are left for future studies.
In principle, our concept of tBME could be generalized to a time-windowed calibration analysis using any other model performance metric based on the residuals between model predictions and data. However, we prefer using BME as a model error indicator because it is consistent with the surrounding Bayesian framework, it takes into account measurement noise, and it is cheap to obtain upon calculation of likelihoods that are needed for Bayesian calibration (i.e., to obtain posterior parameter and predictive distributions).
Overall, we have shown the theoretical rigor of the proposed method, and its usefulness for detecting model errors in various scenarios. We are confident that the concept of the tBME analysis will be valuable in a wide range of disciplines that rely on predictions by dynamic models with potential model errors.
Appendix A Further results A1 Synthetic Case 1, τ = 10 Days Fig. A1 presents the 95% posterior credible intervals of the water retention curves and the unsaturated hydraulic conductitvity functions at state I and state III* for a window size of τ = 10 days. Comparing the uncertainty intervals of these two states in Fig  A1a, we observe slightly decreasing uncertainty when the time window moves from the error-free state (state I) to a parameter compensation state (state III*). This is caused by the observation points leaving the high-density area of the predictive prior. A3 synthetic case 2, τ = 20 We also note the familiar pattern of two drops in the tBME curve (e.g., at t = 100 days for τ = 10 days) flattening out to be one single valley for τ = 20 days. This pattern indicates that the time span of model error is between those window sizes. In other words, only states II (initial decline) and IV (final rise) are clearly observed now, but state III (intermediate rise and fall) is absent. This can be investigated in more detail by comparing the zoomed-in tBME curves for τ = 10 and τ = 20 days (Figs. 11 and A4, respectively) during days 90 to 110. The tBME curve for τ = 10 shows two drops (state III * identifiable), while only one wide valley appears in the tBME curve for τ = 20 (state III). This signal merge occurs when the time window is larger than the error period itself, because compensation of model error through parameter adjustment cannot happen, and so the tBME curve does not return to the high-density region of the tBME sampling distribution. Hence, it can be judged from Figs. 10f and A4c that the error persists for at least ten and less than 20 days. This conclusion is correct -we can verify it from estimating the length of significant residuals (gray bars in Fig. 10b) produced by the second error period which is 13 days. This number is also obtained when determining L e = L s − τ based on theoretical considerations, with L s = 33. A6 Real data, τ = 10 Figure A9 presents the 99 % uncertainty range of WRC at state I and state III in Fig. 20, which is the real data case analyzed with τ = 10.