--- title: "Introduction to mhurdle" output: bookdown::pdf_document2: keep_tex: yes extra_dependencies: ["xcolor"] bibliography: ../inst/REFERENCES.bib vignette: > %\VignetteIndexEntry{Introduction to mhurdle} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction Data collected by means of households' expenditure survey may present a large proportion of zero expenditures due to many households recording, for one reason or another, no expenditure for some items. Analyzing these data requires to model any expenditure with a large proportion of nil observations as a dependent variable left censored at zero. Since the seminal paper of @TOBIN/58, a large econometric literature has been developed to deal correctly with this problem of zero observations. The problem of censored data has also been treated for a long time in the statistics literature dealing with survival models. In applied microeconometrics, different decision mechanisms have been put forward to explain the appearance of zero expenditure observations. The original Tobin's model takes only one of these mechanisms into account. With `mhurdle`, up to three mechanisms generating zero expenditure observations may be introduced in the model^[his package is a new version of a package first developed as part of a PhD dissertation carried out by St\'ephane @HOAR/09 at the University of Reunion under the supervision of Fabrizio Carlevaro and Yves Croissant.]. More specifically, we consider the following three zero expenditure generating mechanisms. - A good selection mechanism (hurdle 1). According to this mechanism, the consumer\footnote{The consumer we are referring to is that of the microeconomic theory, an abstract economic agent responsible of the decisions of a consumption unit that may be an individual, a family, a household. According to the economic literature, we term this concept "the consumer" by convenience.} first decides which goods to include in its choice set and, as a consequence, he can discard some marketed goods because he dislikes them (like meat for vegetarians or wine for non-drinkers) or considers them harmful (like alcohol, cigarettes, inorganic food, holidays in dangerous countries), among others. This censoring mechanism has been introduced in empirical demand analysis by @CRAGG/71. It allows to account for the non-consumption of a good as a consequence of a fundamentally non-economic decision motivated by ethical, psychological or social considerations altering the consumer's preferences. - A desired consumption mechanism (hurdle 2). According to this mechanism, once a good has been selected, the consumer decides which amount to consume and, as a consequence of his preferences, resources and selected good prices, its rational decision can turn out to be a negative desired consumption level leading to a nil consumption. The use of this mechanism, to explain the presence of zero observations in family expenditure surveys, was introduced by @TOBIN/58. Its theoretical relevance has been later rationalized by the existence of corner solutions to the microeconomic problem of rational choice of the neoclassical consumer.^[See section 10.2 of @AMEM/85, for an elementary presentation of this issue, and chapter 4 of @PUDNEY/89, for a more comprehensive one.] @CRAGG/71's combination of this desired consumption mechanism and the selection mechanism leads a double hurdle model, also called two-part model, which has been extended by @BLUNDELL/87 to the case where the two equations are correlated. - A purchasing mechanism (hurdle 3). According to this mechanism, once a consumption decision has been taken, the consumer sets up the schedule at which to buy the good and, as a consequence of its purchasing strategy, a zero expenditure may be observed if the survey by which these data are collected is carried out over a too short period with respect to the frequency at which the good is bought. This censoring mechanism has been introduced in empirical demand analysis by @DEATO/IRISH/84. It allows to account for the non-purchase of a good not because the good is not consumed but because it is an infrequently bought good or service. This is typically the case for durable or storable non durable goods, but also for many infrequently bought services like health, leisure, education and training services. Remarkably, this mechanism allows to derive from observed expenditures the rate of use of a durable good, the rate of consumption of a stored non durable good, and the rate of purchase of an infrequently consumed non storable non durable good or service. For each of these censoring mechanisms, a continuous latent variable is defined, indicating that censoring is in effect when the latent variable is negative. These latent variables are modeled as functions of explanatory variables and of a random disturbance, with a possible correlation between the disturbances of different latent variables in order to account for a possible simultaneity of the decisions modeled by censoring mechanisms. To model possible departures of the observed dependent variable from normality, as it is common with economic variables, we use transformations of this variable allowing to rescale non normal random variables to normality. By combining part or the whole set of these censoring mechanisms, we generate a set of non-nested parametric models that can be used to explain censored expenditure data depending on the structural censoring mechanisms that a priori information suggests to be at work. These formal models have been primarily developed to deal with censored household expenditure data, and numerous applications have been carried out in this field. @SMITH:02 presented an overview of these studies. Using an updated overview of these studies carried out by ourselves, we note a late popularity of Cragg's approach, as the first applications of hurdle models are published in the late of 1980s, namely almost two decades after the publication of @CRAGG/71 's paper. Since then, a large variety of demand models including one or two among the previous three censoring mechanisms are estimated. However, none of these studies use the three censoring mechanisms we consider, jointly. From the 1990s on, many studies account for deviations of the desired consumption relation to homoscedasticity and normality by modeling the standard error of this variable as a non negative parametric function of some explanatory variables and by transforming its distribution to normality using either the logarithmic, the Box-Cox or the inverse hyperbolic sine transformations. The estimation of a correlation coefficient between disturbances is also performed in several of these studies, with an increasing success over time, in terms of statistical significance of estimates. The practical scope of multiple hurdle Tobit models is not restricted to empirical demand analysis but has been fruitfully used in other fields of economics. This includes labor economics, contingent valuation studies, finance, sport activities, internet use, gambling, production. Our hurdle models are specified as fully parametric models allowing estimation and inference within an efficient maximum likelihood framework. In order to identify a relevant model specification, goodness of fit measures for model evaluation as well as Vuong tests for discriminating between nested and non nested models have been implemented in `mhurdle` package. Vuong tests remarkably permit to compare two competing models when neither of them contain the true mechanism generating the sample of observations. More precisely, such tests allow to assess which of the two competing models is closest to the true unknown model according to the Kullback-Leibler information criterion. Therefore, such tests are not intended, as classical Neyman-Pearson tests, to pinpoint the chimeric true model, but to identify a best parametric model specification (with respect to available observations) among a set of competing specifications. As a consequence, they can provide inconclusive results, which prevent from disentangling some competing models, and when they are conclusive, they don't guarantee an identification of the relevant model specification. Survival models are implemented in `R` with the `survival` package of @SURVA/08. It has also close links with the problem of selection bias, for which some methods are implemented in the `sampleSelection` package of @TOOME/HENNI/08. It is also worth mentioning that a convenient interface to `survreg`, called `tobit`, particularly aimed at econometric applications is available in the `AER` package of @KLEI/ZEIL/08. More enhanced censored regression models (left and right censoring, random effect models) are available in the `censReg` package [@HENN:13]. Some flavor of hurdle models have also been developed for count data and are implemented in the `hurdle` function of the `pscl` package [@ZEIL:KLEI:JACK:08]. The paper is organised as follows: Section 2 presents the rationale of our modelling strategy. Section 3 presents the theoretical framework for model estimation, evaluation and selection. Section 4 discusses the software rationale used in the package. Section 5 illustrates the use of **mhurdle** with a real-world example. Section 6 concludes. # Modelling strategy ## Model specification Our modeling strategy is intended to model the level $y$ of expenditures of a household for a given good or service during a given period of observation. To this purpose, we use up to three zero expenditure generating mechanisms, called hurdles, and a demand function. Each hurdle is represented by a qualitative binary response model resting on one of the following three latent dependent variables relations: \begin{equation} (\#eq:threeeqs) \left\{ \begin{array}{l} y_1^* = \beta_1^\top x_1 + z_1 \\[4pt] T(y_2^*) = \beta_2^\top x_2 + \sigma z_2 \\[4pt] y_3^* = \beta_3^\top x_3 + z_3 \\ \end{array} \right. \end{equation} where $y_1^*$, $y_2^*$, $y_3^*$ stand for continuous latent dependent variables $x_1$, $x_2$, $x_3$ for column-vectors of explanatory variables (called covariates in the followings), $\beta_1$, $\beta_2$, $\beta_3$ for column-vectors of the impact coefficients of the explanatory variables on the latent dependent variables, $z_1$, $z_2$, $z_3$ for standard normal random disturbances. Since variables $y_1^*$ and $y_3^*$ are never observed, contrary to $y_2^*$, the units of measurement of $y_1^*$ and $y_3^*$ are not identified. Hence, the disturbances in the corresponding binary response models are normalized by setting their variances equal to 1. Relying on a @JOHN:49's proposal, $T(\cdot)$ denotes a differentiable one-to-one transformation intended to rescale a potentially not normally distributed random variable $y_2^*$ into a normally distributed ones, with $\sigma$ standard deviation. As the domain of definition of the inverse transformation $T^{-1}(\cdot)$ my be restrained to a subset of real numbers, the standard normal random disturbance $z_2$ is potentially truncated at the bounds of an interval $[B_1,B_2]$ with $B_1$ and/or $B_2$ finite. To account for heteroscedasticity of $z_2$, in `mhurdle` the standard deviation $\sigma$ can be specified as a positive monotonic transformation of a linear function of a vector of covariates $x_4$, namely $\beta_4^\top x_4$. However, for notational simplicity, we will adopt in what follows the notation of an homoscedastic standard deviation $\sigma$ rather than the more general one of an heteroscedastic standard deviation $\sigma(\beta_4^\top x_4)$. ^[For a practical application of this parametric model of heteroscedasticity, a functional form of $\sigma(\beta_4^\top x_4)$ written as $\exp\{\beta_{40}F(\beta_4^\top x_4)\}$, where $F(.)$ is the distribution function of the standard normal random variable, has been programmed in `mhurdle`. This functional form remains bounded with respect to any possible values of covariates $x_4$ and allows testing the assumption of homoscedasticity, $\sigma=\exp\{\beta_{40}/2\}$, by assessing the statistical non significance of parameter vector $\beta_4$ without an intercept.] These equations model the hurdles, namely the sequence of binary decisions a consumer faces when buying a good. Each hurdle decision is coded by a binary variable $I_j$ taking value 1 if $y_j^*>0$ and $0$ otherwise. Therefore, the outcome of each of such decisions is modeled as a probit model with outcome probabilities: $\mbox{P}(I_j = 1) =\mbox{P}(y_j^*>0)$ and $\mbox{P}(I_j = 0) =1-\mbox{P}(y_j^*>0)$. By assuming that consumption and purchases are uniformly distributed over time, but according to different timetables entailing a frequency of consumption higher than that of purchasing, we can also interpret the probability $\mbox{P}(I_3=1)=\mbox{P}(y_3^*>0)$ as measuring the share of purchasing frequency to that of consumption during the observation period. This allows to relate the observed level of expenditures $y$ to the unobserved level of consumption $y_2^*$ during the observation period, using the following identity: \begin{equation} (\#eq:expcons) y = \frac{T^{-1} (\beta_2^\top x_2 + \sigma z_2)}{\mbox{P}(I_3=1)} I_1 I_2 I_3 . \end{equation} This modeling strategy provides an explanatory framework where censoring of the observed dependent variable results from the effect of up to three different censoring mechanisms, namely: $I_1=0$ and/or $I_2=0$ and/or $I_3=0$. A priori information (theoretical or real-world knowledge) may suggest that one or more of these censoring mechanisms are not in effect. For instance, we know in advance that all households purchase food regularly, implying that the first two censoring mechanisms are inoperative for food. In this case, the relevant model is defined by only two relations: one defining the desired consumption level of food and the other the decision to purchase food during the observation period. When the first and/or the third hurdles are supposed not to be in effect, the previous binary response models for $I_1$ and/or $I_3$ must be replaced by singular probability response models where $\mbox{P}(I_1)=1$ and/or $\mbox{P}(I_3)=1$. When the second hurdle is supposed to be inoperative, we must not only replace the binary response model explaining $I_2$ by a singular probability response model where $\mbox{P}(I_2 = 1) =1$ but also select a monotonic transformation $T(y_2^*)$ enforcing non-negative values to $y_2^*$. @CRAGG/71 suggested two types of functional forms of the distribution of $y_2^*$ enforcing such constraint, namely the (inverse) log-normal distribution: \begin{equation} (\#eq:lognorm) y_2^* =\exp\{\beta_2^\top x_2+\sigma z_2\} \end{equation} and the normal distribution left-truncated at $0$: \begin{equation} (\#eq:trunc) y_2^* =\max \{0;\beta_2^\top x_2+\sigma z_2\}=\beta_2^\top x_2 + \sigma \max \{B_1;z_2\} \end{equation} where $\max \{B_1;z_2\}$ is a standard normal random variable left-truncated at $B_1=-\beta_2^\top x_2/\sigma$. To account for these two functional forms within our general formal framework \@ref(eq:threeeqs) and illustrate in section 5 their empirical use with `mhurdle`, in this paper we consider two transformations $T(y_2^*)$.^[To model possible departures of the observed dependent variable $y$ from normality towards distributions of the kind encountered with real-world economic data, two families of flexible parametric transformations $T(.)$ were programmed in `mhurdle`, namely the two parameter @BOX:COX:64 transformation, as suggested by @LANK/WYCK/91 and @CHAZ:05, and @JOHN:49's one parameter inverse hyperbolic sine transformation. The first family, which includes transformations \@ref(eq:trunc) and \@ref(eq:lognorm2) as special cases, generates a broad range of skewed distributions, while the second family generates a broad range of symmetric leptokurtic (more sharply peaked than the normal) distributions.] The first transformation is the identity transformation which generates an un-truncated normal distribution of $y_2^*$ when hurdle 2 is in effect, and the left-truncated normal distribution \@ref(eq:trunc) when hurdle 2 is not in effect. The second transformation is a slight generalization of the logarithmic transformation leading to the (inverse) log-normal distribution \@ref(eq:lognorm), namely: \begin{equation} (\#eq:lognorm2) T(y_2^*) = \ln (y_2^* + \alpha) \end{equation} with $\alpha$ a location parameter which shifts the left-truncation bound of $y_2^*$ towards negative or positive values according to the sign of $\alpha$. As the inverse of this transformation is written as: \begin{equation} (\#eq:lognorm2i) y_2^* = \exp\{\beta_2^\top x_2+\sigma z_2\} - \alpha \end{equation} it turns out that this transformation allows to model skewed distributions of $y_2^*$ according to a translated (inverse) log-normal functional form where the support of $z_2$ is un-truncated. By the same token, testing the statistical significance of parameter $\alpha$, against the alternative $\alpha > 0$, amounts to testing the assumption that hurdle 2 is not in effect. Conversely, when $\alpha < 0$ the transformation \@ref(eq:lognorm2) holds for $y_2^* > -\alpha$ meaning that $-\alpha$ stands for a "committed" consumption of a basic necessity, while $\alpha > 0$ typify a luxury good whose consumption occurs only above a given income threshold. ^[This economic interpretation of parameter $\alpha$ my justify to model this parameter as a function of some covariates, like household demographic characteristics.] To derive the form of the probability distribution of the observable dependent variable $y$, we must still specify the joint distribution of the random disturbances entering the hurdle relations assumed to be in effect and that of the demand function, which in turn depends on the domain of definition of transformation $T^{-1}(\cdot)$. For trivariate hurdle models, where hurdles 1 and 3 are in effect (but not necessarily hurdle 2), the joint density function of $z_1$, $z_2$ and $z_3$ is a possibly truncated standard trivariate normal density function written as: \begin{equation} (\#eq:trivdens) \phi\left(z_1,z_2,z_3;\rho \right)/\Pi \end{equation} where $\Pi=\Phi(B_2)-\Phi(B_1)$ with $\Phi(z)$ the distribution function of a standard univariate normal distribution, $\phi(z_1,z_2,z_3;\rho)$ the density function of a standard trivariate normal distribution with $\rho$ the vector of the three symmetric correlation coefficients $\rho_{12}$, $\rho_{13}$, $\rho_{23}$ between the couples of normal standard random variables $z_1$ and $z_2$, $z_1$ and $z_3$, $z_2$ and $z_3$, respectively. For bivariate hurdle models, where either hurdle 1 or hurdle 3 is not in effect, the joint density function of $z_i$, with $i=1$ or $3$, and $z_2$ is a possibly truncated standard bivariate normal density function written as: \begin{equation} (\#eq:bivdens) \phi\left(z_i,z_2;\rho_{i2}\right)/\Pi \end{equation} where $\phi\left(z_i,z_2;\rho_{i2}\right)$ denotes the density function of a standard bivariate normal distribution. Finally, for univariate hurdle models, where both hurdles 1 and 3 are not in effect, the density function of $z_2$ is a possibly truncated standard univariate normal density function: \begin{equation} (\#eq:unidens) \phi(z_2)/\Pi \end{equation} where $\phi(z_2)$ denotes the density function of a standard univariate normal distribution. While the assumption of correlated disturbances is intended to account for the interdependence between latent variables $y_1^*$, $y_2^*$ and $y_3^*$ unexplained by covariates $x_1$, $x_2$ and $x_3$, a priori information (theoretical or real-world knowledge) may also suggest to set to zero some or all correlations between the random disturbances entering these models, entailing a partial or total independence between model relations. The use of this a priori information generates, for each trivariate or bivariate hurdle model, a subset of special models all nested within the general model from which they are derived. For a trivariate hurdle model the number of special models so derived is equal to seven, but for a bivariate hurdle model only one special model is generated, namely the model obtained by assuming the independence between the two random disturbances of the model. ## Probability distribution of censored dependent variable \newcommand{\vsup}[1]{\overline{#1}} \newcommand{\vinf}[1]{\underline{#1}} \renewcommand{\vsup}[1]{B} \renewcommand{\vinf}[1]{A} As for the standard Tobit model, the probability distribution of the observed censored variable $y$ of our hurdle models is a discrete-continuous mixture, which assigns a probability mass $\mbox{P}(y=0)$ to $y=0$ and a density function $f_+(y)$ to any $y>0$, with: \begin{equation} (\#eq:sumone) \mbox{P}(y=0)+\int_0^\infty f_+(y)dy=1. \end{equation} To compute the probability mass $\mbox{P}(y=0)=1-\mbox{P}(y>0)$ and the density function $f_+(y)$ we must first establish the joint density function of the latent variables entering the hurdle model. For trivariate hurdle models, the joint density function of latent variables $y_1^*$, $y_2^*$ and $y_3^*$ can be derived from the density function \@ref(eq:trivdens) of variables $z_1$, $z_2$ and $z_3$ by means of the change of variables: \begin{equation} (\#eq:threez) \left\{ \begin{array}{l} z_1 = y_1^* - \beta_1^\top x_1 \\ z_2 = (T(y_2^*) - \beta_2^\top x_2)/\sigma \\ z_3 = y_3^* - \beta_3^\top x_3 \\ \end{array} \right. \end{equation} which leads to: \begin{equation} (\#eq:threedens) f(y_1^*, y_2^*, y_3^*)=\frac{T^{'}(y_2^*)}{\sigma} \frac{\phi\left(y_1^* - \beta_1^\top x_1, (T(y_2^*)-\beta_2^\top x_2)/\sigma, y_3^* - \beta_3^\top x_3;\rho \right)}{\Pi}, \end{equation} where $T^{'}(y_2^*)$ stands for the derivative of $T(y_2^*)$. To compute $\mbox{P}(y>0)$, we integrate this density function over the three-dimensional positive octant using the change of variables \@ref(eq:threez) and the symmetry property of density function \@ref(eq:trivdens). Performing this integration for a strictly increasing transformation $T(\cdot)$^[For a strictly decreasing transformation, as it is the case for some transformations of the Box-Cox family, the integration with respect to $z_2$ must be performed by reversing the direction of integration, from $(T(0)-\beta_2^\top x_2)/\sigma$ to $B_1$. By applying the theorem on the inversion of the integration limits of a definite integral, we obtain a closed form of the integral similar to that of formula \@ref(eq:pzero) with truncation value $B_2$ replaced by $B_1$.], which is the case for the identity and logarithmic transformations we focus in this paper, and $z_2$ truncated over the interval $[B_1,B_2]$ leads to: \begin{equation} (\#eq:pzero) \begin{array}{l} \mbox{P}(y>0) = {\displaystyle \int_{-\beta_1^\top x_1}^\infty \int_{\frac{T(0)-\beta_2^\top x_2}{\sigma}}^{B_2} \int_{-\beta_3^\top x_3}^\infty \frac{\phi(z_1,z_2,z_3;\rho)}{\Pi} dz_1 dz_2 dz_3}\\[16pt] \phantom{\mbox{P}(y>0)} ={\displaystyle \int_{-\infty}^{\beta_1^\top x_1} \int_{-B_2}^{\frac{\beta_2^\top x_2 -T(0)}{\sigma}} \int_{-\infty}^{\beta_3^\top x_3} \frac{\phi(z_1,z_2,z_3;\rho)}{\Pi} dz_1 dz_2 dz_3}\\[16pt] \phantom{\mbox{P}(y>0)} ={\displaystyle \frac{\Phi(\beta_1^\top x_1,(\beta_2^\top x_2-T(0))/\sigma, \beta_3^\top x_3;\rho)-\Phi(\beta_1^\top x_1,-B_2,\beta_3^\top x_3;\rho)}{\Pi}}. \end{array} \end{equation} Therefore, for $z_2$ left-truncated at $B_1=-\beta_2^\top x_2/\sigma$, which is the case for the truncated normal distribution \@ref(eq:trunc), this closed form for $\mbox{P}(y>0)$ simplifies to: \begin{equation} (\#eq:pzero2) \begin{array}{l} \mbox{P}(y>0) = {\displaystyle \frac{\Phi(\beta_1^\top x_1,(\beta_2^\top x_2-T(0))/\sigma, \beta_3^\top x_3;\rho)}{\Phi(\beta_2^\top x_2/\sigma)}}, \end{array} \end{equation} while for the un-truncated normal and the translated (inverse) log-normal distribution \@ref(eq:lognorm2i) it is written as: \begin{equation} (\#eq:pzero3) \begin{array}{l} \mbox{P}(y>0) = \Phi(\beta_1^\top x_1,(\beta_2^\top x_2-T(0))/\sigma,\beta_3^\top x_3;\rho). \end{array} \end{equation} To compute the density function $f_+(y)$ we first derive the joint density function of variables $y_1^*$, $y$ and $y_3^*$ by performing the change of variable $y_2^*=\Phi_3 y$ with $\Phi_3=\Phi(\beta_3^\top x_3)=\mbox{P}(I_3=1)$ on the joint density function \@ref(eq:threedens), which leads to : \begin{equation} (\#eq:denstriv) f(y_1^*, y, y_3^*)=\frac{\Phi_3 T^{'}(\Phi_3 y)}{\sigma} \frac{\phi\left(y_1^* - \beta_1^\top x_1, (T(\Phi_3 y)-\beta_2^\top x_2)/\sigma, y_3^* - \beta_3^\top x_3;\rho\right)}{\Pi}. \end{equation} We must then integrate this transformed density function over the positive quadrant of the bi-dimensional support of variables $y_1^*$ and $y_3^*$. To this purpose we rewrite this trivariate normal density function as the product of the marginal density function of $y$, written as: \begin{equation} (\#eq:mdensy) f(y)=\frac{\Phi_3 T^{'}(\Phi_3 y)}{\sigma} \frac{\phi\left((T(\Phi_3 y)-\beta_2^\top x_2)/\sigma\right)}{\Pi}, \end{equation} and of the conditional joint density function $f(y_1^*,y_3^*|y)$, which is bivariate normal with expectations, standard-deviations and correlation coefficient written as: \begin{equation} (\#eq:paramdist) \begin{array}{l} {\displaystyle \mu_{i|2}(y)=\beta_i^\top x_i+\rho_{i2}(T(\Phi_3 y)-\beta_2^\top x_2)/\sigma, \quad \sigma_{i|2}=\sqrt{1-\rho_{i2}^2}, \quad i=1,3} \\[6pt] {\displaystyle\rho_{13|2}=\frac{\rho_{13}-\rho_{12}\rho_{23}}{\sqrt{1-\rho_{12}^2}\sqrt{1-\rho_{23}^2}}}. \end{array} \end{equation} Integrating this factorization of density function \@ref(eq:denstriv) with respect to positive values of $y_1^*$ and $y_3^*$ leads to the following closed form for $f_+(y)$: \begin{equation} (\#eq:denspos) \begin{array}{l} f_+(y)=f(y)\int_0^\infty\int_0^\infty f(y_1^*,y_3^*|y) dy_1^*dy_3^* \\ [6pt] \phantom{f_+(y)}=f(y){\displaystyle \int_{-\frac{\mu_{1|2}(y)}{\sigma_{1|2}}}^\infty \int_{-\frac{\mu_{3|2}(y)}{\sigma_{3|2}}}^\infty \phi\left(z_1,z_3;\rho_{13|2}\right) dz_1 dz_3}\\ [16pt] \phantom{f_+(y)}=f(y){\displaystyle\Phi\left(\frac{\mu_{1|2}(y)}{\sigma_{1|2}},\frac{\mu_{3|2}(y)}{\sigma_{3|2}};\rho_{13|2}\right)}. \end{array} \end{equation} The probability distribution of the observed censored variable $y$ for bivariate and univariate hurdle models can be derived from that of the trivariate model by eliminating hurdles 1, 3 and 1 and 3, respectively. By eliminating hurdle 1 we obtain: \begin{equation} (\#eq:pzerosm1) \mbox{P}(y>0) = {\displaystyle \frac{\Phi\left((\beta_2^\top x_2 - T(0))/\sigma, \beta_3^\top x_3;\rho_{23}\right)-\Phi\left(-B_2,\beta_3^\top x_3;\rho_{23}\right)}{\Pi}} \end{equation} and \begin{equation} (\#eq:densposm1) f_+(y)=\frac{\Phi_3 T^{'}(\Phi_3 y)}{\sigma}\frac{\phi\left((T(\Phi_3 y)-\beta_2^\top x_2)/\sigma\right)}{\Pi} {\displaystyle\Phi\left(\frac{\mu_{3|2}(y)}{\sigma_{3|2}}\right)}, \end{equation} while the elimination of hurdle 3 leads to: \begin{equation} (\#eq:pzerom3) \mbox{P}(y>0) = {\displaystyle \frac{\Phi\left(\beta_1^\top x_1,(\beta_2^\top x_2 - T(0))/\sigma;\rho_{12}\right) -\Phi\left(\beta_1^\top x_1,-B_2;\rho_{12}\right)}{\Pi}} \end{equation} and \begin{equation} (\#eq:densposm3) f_+(y)=\frac{ T^{'}(y)}{\sigma}\frac{\phi\left((T(y)-\beta_2^\top x_2)/\sigma\right)}{\Pi} {\displaystyle\Phi\left(\frac{\mu_{1|2}(y)}{\sigma_{1|2}}\right)}. \end{equation} Finally, removing both hurdles 1 and 3 leads to the univariate censured regression model with: \begin{equation} (\#eq:pzerom13) \mbox{P}(y>0) = {\displaystyle \frac{\Phi\left((\beta_2^\top x_2 - T(0))/\sigma\right)-\Phi\left(-B_2\right)} {\Pi}} \end{equation} and \begin{equation} (\#eq:densposm13) f_+(y)={\displaystyle \frac{ T^{'}(y)}{\sigma}\frac{\phi\left((T(y)-\beta_2^\top x_2)/\sigma\right)}{\Pi}}. \end{equation} The econometric framework described in the previous section provides a theoretical background for tackling the problems of model estimation, evaluation, selection and prediction within the statistical theory of classical inference. In `mhurdle` these statistical issues are tackled by assuming that data at hand are those observed on a sample of individuals selected in a large population using a sampling design generating a data base having the features of a random sample. ^[Data provided by surveys conducted by official statistical offices usually fulfill such requirement, at least as far as the survey is performed according to a random sampling design where no one of the modeled dependent variables was used to design the sampling scheme. On this important issue see Chapter 24 of @CAME:TRIV:05.] To appraise the results of a model estimation two fundamental principles should be used, namely its economic relevance and its statistical and predictive adequacy. The first principle deals with the issues of accordance of model estimate with the economic rationale underlying the model specification and of its relevance for answering the questions for which the model has been built. These issues are essentially context specific and, therefore, cannot be dealt with by means of generic criteria. The second principle refers to the issues of empirical soundness of an estimated model and of its ability to predict sample or out-of-sample observations. These issues can be tackled by means of measures of goodness of fit, by formal tests of significance based on the distribution of model parameter estimators, and by the quality of predictions provided by an estimated model, respectively. # Likelihood function The full parametric specification of our multiple hurdle models allows to efficiently estimate their parameters by means of the maximum likelihood principle. Indeed, it is well known from classical estimation theory that, under the assumption of a correct model specification and for a likelihood function sufficiently well behaved, the maximum likelihood estimator is asymptotically efficient within the class of consistent and asymptotically normal estimators ^[See @AMEM/85 chapter 4, for a rigorous statement of this property.]. For a random sample of $n$ observations of the censored dependent variable $y$ it is easy to derive its likelihood function from the results set out in previous section 2.2. As these observations are all independently drawn from the same conditional (on covariates $x_1$, $x_2$, $x_3$ and $x_4$) discrete-continuous distribution, which assigns a conditional probability mass $\mbox{P}(y=0)$ to the observed value $y=0$ and a conditional density function $f_+(y)$ to the observed values $y>0$, the log-likelihood function for an observation $y_i$ can be written as : \begin{equation} (\#eq:eq26) \ln L_i = \left\{ \begin{array}{ll} \ln \mbox{P}(y_i=0) & \mbox{if} \quad y_i=0 \\ \ln f_+(y_i) & \mbox{if} \quad y_i>0 \end{array} \right. \end{equation} and the log-likelihood for the entire sample: \begin{equation} (\#eq:eq27) \ln L = \sum_{i=1}^n \ln L_i= \sum_{i \mid y_i = 0} \ln \mbox{P}(y_i=0) + \sum_{i \mid y_i > 0} \ln f_+(y_i). \end{equation} A maximum likelihood estimate of model parameters is obtained by numerically maximizing this function with respect to all the unknown model parameters entering this function. Algorithms used in `mhurdle` to perform this numerical optimization are described later in section 4.2. To the extent that such an estimate is obtained, classical statistical inference about model parameters $\theta$ under the assumption of a correctly specified model is performed in `mhurdle` using the asymptotic approximation of the distribution of maximum likelihood estimators, which is normal, with expectation $\theta$ and covariance matrix $(nI_A(\theta))^{-1}$ where $I_A(\theta)$ stands for the asymptotic Fisher information matrix of a random sample of $n$ observations ^[A more robust approach to classical statistical inference with respect to model specification errors will be presented later, in subsection 3.3.]. For empirical applications this theoretical matrix can be consistently estimated using one of the following two matrices: $$ \frac{1}{n} \sum_{i=1}^n \frac{\partial^2\ln L_i(\hat{\theta})}{\partial\theta\partial\theta^\top} \quad \mbox{or} \quad \frac{1}{n} \sum_{i=1}^n \frac{\partial\ln L _i(\hat{\theta})}{\partial\theta}\frac{\partial\ln L_i(\hat{\theta})}{\partial\theta^\top}$$ which limits (for $n\rightarrow\infty$) are called the Hessian and the gradient versions of $I_A(\theta)$. These matrices are directly provided by two standard iterative methods used to compute $\hat{\theta}$, namely the Newton-Raphson and BHHH methods mentioned in section 4.2. # Model evaluation and selection using goodness of fit measures ## Model evaluation and selection using goodness of fit measures To assess the goodness of fit of a `mhurdle` model estimate we can not rely on a single measure, as a sample of censored observations of dependent variable $y$ consists of two sub-samples of incomparable observations, namely the sub-sample of zero observations and that of positive observations. The former is a sample of observations of a qualitative binary variable coding the outcome of the censoring mechanisms at work, while the latter is a sample of observations of a quantitative variable measuring the outcome of the demand function, when this demand is uncensured. As a consequence, a measure of goodness of fit for each of these two samples of observations must be used as the goodness of fit is not measured the same way in qualitative binary response models as in quantitative response models. To this purpose, we reformulate a `mhurdle` model as a two-part model. The first-part specifies a model for the censoring mechanism explaining the outcome of the qualitative binary variable: \begin{equation} (\#eq:eq28) d = \left\{ \begin{array}{ll} 0 & \mbox{if} \quad y=0 \\ 1 & \mbox{if} \quad y>0 \end{array} \right. \end{equation} according to a probit model assigning the probability $\mbox{P}(y=0)$ to the censored observation ($d=0$) and the probability $\mbox{P}(y>0)=1-\mbox{P}(y=0)$ to the uncensored observation ($d=1$). Therefore, the log-likelihood for this first-part model is written as: \begin{equation} (\#eq:29) \ln L_0 = \sum_{i=1}^n [(1-d_i) \ln \mbox{P}(y_i=0) + d_i \ln \mbox{P}(y_i>0)]. \end{equation} The second-part specifies, conditionally to an uncensored response being observed, the demand model explaining the outcome of quantitative variable $y$ according to the following conditional density function: \begin{equation} (\#eq:eq30) f(y|y>0)=f_+(y)/\mbox{P}(y>0). \end{equation} For this second-part model, the log-likelihood is written as: \begin{equation} (\#eq:eq31) \ln L_+ = \sum_{i=1}^n d_i [\ln f_+(y) - \ln \mbox{P}(y_i>0)]. \end{equation} Adding-up the log-likelihoods of these two-part model leads to: \begin{equation} (\#eq:eq32) \begin{array}{l} \ln L_0 + \ln L_+ = \sum_{i=1}^n [(1-d_i) \ln \mbox{P}(y_i=0) + d_i \ln f_+(y)] \\[4pt] \phantom{\ln L_0 + \ln L_+} = \sum_{i \mid y_i = 0} \ln \mbox{P}(y_i=0) + \sum_{i \mid y_i > 0} \ln f_+(y_i)=\ln L, \end{array} \end{equation} meaning that the two-part model is entirely consistent with an integrated specification of `mhurdle` model. To develop a measure of fit for binary response models which may be intuitively interpreted in a similar way to the coefficient of determination $R^2$ in the linear regression model, @ESTR:98 suggests to rely on the relationship between $R^2$ and the classical likelihood-based test statistics, namely the likelihood ratio (LR), Wald (W) and Lagrange multiplier/score (LM) statistics. In the normal linear regression model, where such relationships holds exactly for the likelihood ratio statistics^[For the Wald and Lagrange multiplier statistics such relationships holds only locally, in a neighborhood of the maintained hypothesis of an intercept-only model, and takes another functional form.], it is written as: \begin{equation} (\#eq:eq33) R^2 = 1-\exp \{-A\} \end{equation} with $A=-(2/n)\ln(L_c/L_u)$ the average value of the likelihood ratio statistic, $L_u$ the maximum value of the unconstrained likelihood, and $L_c$ the maximum value of the likelihood constrained by the hypothesis of a model without covariates (intercept-only model). From this relationships, it follows that $R^2=0$ for $A=0$, which corresponds to the case where the fit of the unconstrained model is identical to that of the constrained model (worst fit), and $R^2=1$ when $A=\infty$, the upper bound of the likelihood ratio statistic achieved when the fit of the unconstrained model is perfect (best fit). Between these two limits $R^2$ increases with $A$ according to a rescaling of $A$ determined by a differential equation written as: \begin{equation} (\#eq:eq34) dR^2/(1-R^2) = dA. \end{equation} These properties of relationships \@ref(eq:eq33) were adopted by @ESTR:98 as a set of requirements for developing a $R^2$ analog of the measure of fit for a binary response model. Taking into account that, contrary to the case of a normal linear regression model, for a binary response model the average likelihood ratio statistic $A$ varies between 0 (when $L_u=L_c$) and an upper finite bound $B=-(2/n)\ln L_c$ (when $L_u=1$), a $R^2$ analog for a binary response model is obtained by solving the differential equation: \begin{equation} (\#eq:eq35) dR^2/(1-R^2) = dA/(1-A/B) \end{equation} with the initial condition $R^2(0)=0$. This solution: \begin{equation} (\#eq:eq36) R^2 = 1-(1-A/B)^B \end{equation} provides a rescaling of the likelihood ratio statistic for the binary response model that satisfies the condition $R^2(B)=1$ and converge towards the rescaling \@ref(eq:eq33) when $B\rightarrow\infty$. To conclude, for assessing the goodness of fit of the two parts of a `mhurdle` model we shall use the following $R^2$-rescaling of likelihoods $L_0$ and $L_+$: \begin{equation} (\#eq:eq37) R_0^2 = 1-(\ln L_{0c}/\ln L_{0u})^{(2/n)\ln L_{0c}} \quad \mbox{and} \quad R_+^2 = 1-(L_{+c}/L_{+u})^{2/n_+} \end{equation} with $n_+=\sum_{i=1}^n d_i$ the sample size of uncensored observations. Note that the estimation of an intercept-only `mhurdle` model is not generally possible in its original parametrization because in the absence of covariates many parameters are under-identified, and thus cannot be consistently estimated. To compute the maximum values of the constrained likelihoods $L_{0c}$ and $L_{+c}$, we must respecify the probability distribution of dependent variable $y$ by means of identifiable parameters. To this end we observe from formula \@ref(eq:pzero) that $\mbox{P}(y>0)$ is a constant $P_+$, and from formula \@ref(eq:denspos) that $f_+(y)$ is proportional to the density function of $T(y)$ which is normal with a constant expectation $\mu$ and a constant variance $s^2$. Hence, the log-likelihood of the constrained first-part model is written as: \begin{equation} (\#eq:eq29a) \ln L_{0c}=n_+ \ln (1-P_+) + n_+ \ln P_+, \end{equation} and provides the following maximum likelihood estimator for $P_+$: $\hat{P}_+=n_+/n$. Similarly, the log-likelihood of the constrained second-part model is written as: \begin{equation} (\#eq:eq31a) \ln L_{+c} = \sum_{i=1}^n d_i \ln \phi\left(\frac{T(y_i)-\mu}{s}\right) - n_+ \ln (sP_+). \end{equation} leading to the following estimators for parameters $\mu$ and $s^2$: $\hat{\mu} = \sum_{i=1}^n d_i T(y_i)/n_+$ and $\hat{s}^2=\sum_{i=1}^n d_i (T(y_i)-\hat{\mu})^2/n_+$. The maximized values of constrained likelihoods $L_{0c}$ and $L_{+c}$ are obtained by inserting these estimators for parameters $P_+$, $\mu$ and $s^2$ into formulas \@ref(eq:eq29a) and \@ref(eq:eq31a). Note that for the logarithmic transformation \@ref(eq:lognorm2) with a non-zero location parameter $\alpha$, the constrained log-likelihood \@ref(eq:eq31a) must also be maximized with respect to this parameter. This maximization may be performed effectively by considering $\hat{\mu}$ and $\hat{s}^2$ as functions of $\alpha$, and maximizing the concentrated log-likelihood obtained by inserting these conditional estimators into constrained log-likelihood \@ref(eq:eq31a), which becomes a function of the single parameter $\alpha$ written as: \begin{equation} (\#eq:eq31b) \ln \max_{\mu,\sigma^2}L_{+c} = -\frac{n_+}{2}\left[1+\ln\left(2\pi \frac{n_+}{n}\right)+\ln\hat{s}^2(\alpha)\right]. \end{equation} Goodness of fit measures \@ref(eq:eq37) can also be used for model selection, a decision problem dealing with the identification, among several model specifications used to explain the same dependent variable, of the one that is best suited to explain the sample of available observations. From this point of view, goodness of fit measures \@ref(eq:eq37) allow to identify the model specification achieving the best in-sample fit. To be fair, the goodness of fit measure used for model selection must account for differences in the degree of model parametrization. Indeed, the value of the above goodness of fit measures can be improved by increasing model parametrization, because model estimation is performed by maximizing the model likelihood function from which functionally depend the goodness of fit measures we consider. Consequently, a penalty that increases with the number of model parameters should be added to formulas \@ref(eq:eq37). To this purpose, we shall rely on THEIL(1971)'s correction of the coefficient of determination in the linear regression model. This leads to the following adjusted $R^2$-rescaling of likelihoods $L_0$ and $L_+$: \begin{equation} (\#eq:eq37a) \bar{R}_0^2 = 1-\frac{n-1}{n-K}(\ln L_{0c}/\ln L_{0u})^{(2/n)\ln L_{0c}} \quad \mbox{and} \quad \bar{R}_+^2 = 1-\frac{n-K_0}{n-K}(L_{+c}/L_{+u})^{2/n_+} \end{equation} with $K$ and $K_0$ the number of parameters in the unconstrained model and the number of parameters in the constrained second-part model, respectively. Such adjustments may not be relevant in large samples but may be relevant in small samples. ## Model selection using Vuong tests Model evaluation can also be tackled from the point of view of the model specification that is favored in a formal test comparing two model alternatives. This second model selection criterion relies on the use of a test proposed by @VUONG/89. According to the rationale of this test, the "best" parametric model specification among a collection of competing specifications is the one that minimizes the Kullback-Leibler Information Criterion (KLIC), namely a measure of the distance between the conditional probability mass/density function^[For mhurdle models, this conditional probability function is written as : $$f(y|x)=\mbox{P}(y=0|x)^{1-d}f_+(y|x)^d,$$ with $d$ defined by formula \@ref(eq:eq28).] $f(y|x;\theta)$ of a possibly misspecified parametric model and that of the true unknown model denoted by $h(y|x)$, defined by the following formula: \begin{equation} (\#eq:eq38) KLIC=E\left[\ln \left(\frac{h(y|x)}{f(y|x;\theta_\ast)}\right)\right]=\int \ln \left(\frac{h(y|x)}{f(y|x;\theta_\ast)}\right)dH(y,x), \end{equation} where $H(y,x)$ denotes the distribution function of the true joint distribution of $(y,x)$ and $\theta_\ast$ the probability limit, with respect to $H(y,x)$, of the so called quasi-maximum likelihood estimator $\hat{\theta}$ obtained by applying the maximum likelihood method when $f(y|x;\theta)$ is misspecified. Note that KLIC criterion takes a minimum value of 0 when there is a value $\theta_0$ of parameter vector $\theta$ such that $f(y|x;\theta_0)=h(y|x)$ almost everywhere^[By "almost everywhere" we mean: for all points of the support of the true distribution of $(y,x)$ except those for which the probability $\mbox{P}_H[(y,x): f(y|x;\theta_0)\neq h(y|x)]=0$.], meaning that conditional probability function $f(y|x;\theta)$ is correctly specified. At the opposite, KLIC will take large values when $f(y|x;\theta)$ very poorly specifies the true conditional probability function $h(y|x)$. As the application of this model selection criterion equals selecting the model specification for which the quantity: \begin{equation} (\#eq:eq39) E[\ln f(y|x;\theta_\ast)]=\int \ln f(y|x;\theta_\ast)dH(y,x) \end{equation} is the largest, given two competing models with conditional probability functions $f(y|x;\theta)$ and $g(y|x;\pi)$ and parameter vectors $\theta$ and $\pi$ of size $K$ and $L$, respectively, Vuong suggests to discriminate between these models by testing the null hypothesis: $$ H_0 : E[\ln f(y|x;\theta_\ast)]=E[\ln g(y|x;\pi_\ast)] \Longleftrightarrow E\left[\ln\frac{f(y|x;\theta_\ast)}{g(y|x;\pi_\ast)}\right]=0, $$ meaning that the two models are equivalent, against either: $$ H_f : E[\ln f(y|x;\theta_\ast)]>E[\ln g(y|x;\pi_\ast)] \Longleftrightarrow E\left[\ln\frac{f(y|x;\theta_\ast)}{g(y|x;\pi_\ast)}\right]>0, $$ meaning that specification $f(y|x;\theta)$ is better than $g(y|x;\pi)$, or: $$ H_g : E[\ln f(y|x;\theta_\ast)]\mbox{cv}(a)|H_0]$, and for a one sided test $P[T_{LR}^{mod}>\mbox{cv}(a)|H_0]$.]. By systematically changing the value of $a$ in order to equalize the simulated critical value $\mbox{cv}(a)$ to the observed value of statistic $T_{LR}^{mod}$, it is possible to derive a simulated p-value allowing to perform a non-degenerate Vuong test at any controlled asymptotic size level. For non-nested models the test is two sided. It is carried out by rejecting $H_0$ if the simulated p-value is less than the selected size level $a$, and in favor of either $H_f$ if $T_{LR}^{mod}>0$, or of $H_g$ if $T_{LR}^{mod}<0$. For nested models the test is one sided because the nested model cannot be closer to the true unknown model than the nesting model according to the KLIC criterion. Supposing without loss of generality that model $G_\pi$ is nested into model $F_\theta$, the test is carried out by rejecting $H_0$ in favor of $H_f$ if the simulated p-value is less than the selected size level $a$. Note that this test is a robust alternative to the usual LR-test for testing the statistical significance of a priori restrictions on the parameters of model $F_\theta$, in the sense that it allows the unrestricted model to be misspecified. A last comment deserves to be devoted to the choice of the size $\alpha$ of the test, which measures the probability of rejecting $H_0$ when it is true, and thus the risk of favoring, on grounds of statistical evidence, the use of one of two competing models while they are equivalent. This choice is routinely made by assuming a low value of $\alpha$ (5\% or 1\%) without referring to the operational meaning of the test result. But in a decision-making framework of selection between two models, the size of the test can be chosen as large as desired because if the two models are equivalent it is irrelevant to use one or the other of these two models. On the other end, the risk of the second kind, namely the probability of rejecting either $H_f$ or $H_g$ when one of these two models is true, must be minimized, because concluding that the two models are equivalent does not allow deciding which one of them should be used. Not rejecting $H_0$ simply point out that further sample information is needed to determine, with the desired degree of certainty, which of the two model alternatives is the more suitable for the intended uses. But how to proceed if further data-gathering is impossible, which is the typical situation faced in econometric studies. In this situation the discrimination test between two models must be reformulated as a @SIMO:43 symmetric test, namely a binary choice between models $F_{\theta}$ and $G_{\pi}$ where none of these models is privileged as a null hypothesis of a classical asymmetric Neyman-Pearson test. This reformulation removes the irrelevant hypothesis of doubt $H_0$ from the decision-making framework by forcing the test to always conclude in favor of one of the two confronted models, which seems appropriate when there is no a priori evidence to consider one of the two models more plausible than the other^[The assumption of a Bayesian probability of one half for the plausibility a priori of both confronted models may not be appropriate in the case of two nested models, where one is a special case of the other, and thus the former less plausible than the latter. For these models, an asymmetric Neyman-Person test where the restricted model represents the null hypothesis and the unrestricted model the alternative hypothesis seems more appropriate, in particular when the model parameters on which restrictions are tested have an economic meaning.]. The test is carried out by first computing the value of statistic $T_{LR}$ or $T_{LR}^{mod}$, then by rejecting model $G_{\pi}$ in favor of model $F_{\theta}$ when this value is positive, and conversely by rejecting $F_{\theta}$ in favor of $G_{\pi}$ when the value is negative ^[For nested models, Simon's approach always leads to favor the nesting (unrestricted) model to the nested (restricted) one.]. The probability of committing an error with this symmetric test (accepting model $F_{\theta}$ when it is worse than $G_{\pi}$) ^[There is no need to distinguish between error of the first and second kind, since an error of the first kind for testing $H_f$ against $H_g$ (rejecting model $F_{\theta}$ when it is better than model $G_{\pi}$) is an error of the second kind for testing $H_g$ against $H_f$ (accepting model $G_{\pi}$ when it is worse than $F_{\theta}$), and vice versa.] is a function of $D=E\left[\ln(f(y|x;\theta_\ast)/g(y|x;\pi_\ast))\right]$, since the tested hypothesis is a composite one leaving unspecified the absolute value of the distance $D$ between the distributions of models $F_{\theta}$ and $G_{\pi}$. This risk is maximum for $D=0$ and decreases with $|D|$. The exact value of this risk can be found by integrating the density function of statistics $T_{LR}$ or $T_{LR}^{mod}$ over its negative values^[For the Vuong statistic $T_{LR}$ this distribution for testing $H_f$ against $H_g$ is $N(D,1)$ with $D\geq0$. Hence, the risk of rejecting model $F_{\theta}$ when it is better than model $G_{\pi}$ is equal to $\mbox{P}(N(0,1)<-D)=\Phi(-D)$. This probability decreases from 1/2 to 0 when $D$ increases from 0 to $\infty$.]. ## Prediction and marginal effects Prediction with a causal model like `mhurdle` of an observable dependent random variable $z$ refers to defining a function of model covariates $x$ providing, for any admissible value of $x$, a point estimate $\hat{z}$ of $z$ whose prediction error $e=z-\hat{z}$ is minimized according to some criterion. In the frame of classical or frequentist statistical inference paradigm, an optimal conditional predictor of $z$, for a given value of covariates $x$, is defined as the one that minimizes the expectation of the squared conditional prediction error, namely $E[e^2|x]$ called the mean-square error of prediction. Using this criterion leads to select as optimal predictor for $z$ its conditional expectation $E[z|x]$, called the best mean-square error predictor for $z$. As defined by identity \@ref(eq:expcons), the dependent observable variable $y$ of `mhurdle` is a mixture of a fqualitative and a quantitative variables whose expectation has no actual meaning, the zero-value of $y$ denoting the numerical value taken by the binary dummy variable $d$ when the observation of latent variable $y_2^*$ is censored. Therefore, for a predictive purpose, one must disentangle the qualitative component of $y$ from its purely quantitative component. This can be done by solving the prediction problem of $y$ in the framework of the two-part respecification of `mhurdle` models presented in section 3.2. The first-part of this model explains the outcome of the qualitative component of $y$ using the binary variable $d$, whose best mean-square error predictor is given by: \begin{equation} (\#eq:eq47) E[d|x]=0\times \mbox{P}(y=0)+ 1\times\mbox{P}(y>0)=\mbox{P}(y>0), \end{equation} with $\mbox{P}(y>0)$ specified by formulas \@ref(eq:pzero), \@ref(eq:pzero2), \@ref(eq:pzero3), \@ref(eq:pzerosm1), \@ref(eq:pzerom3) and \@ref(eq:pzerom13). The second-part of the model explains the outcome of the quantitative component of $y$ taking positive real values according to the density function \@ref(eq:eq30). Accordingly, the best mean-square error predictor of $y>0$ is given by: \begin{equation} (\#eq:eq48) E[y|y>0,x]=\int_0^\infty y \frac{f_+(y)}{\mbox{P}(y>0)}dy=\frac{E[y|x]}{\mbox{P}(y>0)}, \end{equation} with $f_+(y)$ specified by formulas \@ref(eq:denspos), \@ref(eq:densposm1), \@ref(eq:densposm3) and \@ref(eq:densposm13). For the special cases of transformations $T(\cdot)$ identity and logarithmic focused in this paper, closed analytical forms for $E[y|x]$ can be established^[Proofs for the analytical forms \@ref(eq:eq49) and \@ref(eq:eq53) are available from the authors.]. For the identity transformation $T(y_2^*)=y_2^*$, this analytical closed form for trivariate hurdle models with density function $f_+(y)$ specified by formula \@ref(eq:denspos) is written as: \begin{equation} (\#eq:eq49) E[y|x]=\frac{\Phi_{123}}{\Phi_3}\beta_2^\top x_2 + \frac{\sigma}{\Phi_3}\left( \phi_2\Phi_{13\mid 2} +\rho_{12}\phi_1\Phi_{23\mid 1} +\rho_{23}\phi_3\Phi_{12\mid 3}\right), \end{equation} where $$ \begin{array}{l} {\displaystyle \Phi_{123}=\Phi\left(\beta_1^\top x_1, \frac{\beta_2^\top x_2}{\sigma}, \beta_3^\top x_3;\rho\right), \quad \phi_i=\phi(\beta_i^\top x_i), \quad i=1,3 , \quad \phi_2=\phi\left(\frac{\beta_2^\top x_2}{\sigma}\right), } \\[6pt] {\displaystyle \Phi_{ij\mid k} = \Phi\left(\frac{\mu_{i|k}}{\sigma_{i|k}},\frac{\mu_{j|k}}{\sigma_{j|k}};\rho_{ij|k}\right), \quad i\neq j\neq k,\quad \mu_{i|k}=\beta_i^\top x_i-\rho_{ik}\beta_k^\top x_k, \quad k=1,3 ,} \\[6pt] {\displaystyle \quad \mu_{i|2}=\beta_i^\top x_i-\rho_{i2}\frac{\beta_2^\top x_2}{\sigma}, \quad \sigma_{i|k}=\sqrt{1-\rho_{ik}^2}, \quad \rho_{ij|k}=\frac{\rho_{ij}-\rho_{ik}\rho_{jk}}{\sigma_{i|k}\sigma_{j|k}}.} \end{array} $$ The closed forms for bivariate and univariate hurdle models with density function $f_+(y)$ specified by formulas \@ref(eq:densposm1), \@ref(eq:densposm3) and \@ref(eq:densposm13), are written as: \begin{equation} (\#eq:eq50) E[y|x]=\frac{\Phi_{23}}{\Phi_3}\beta_2^\top x_2 + \frac{\sigma}{\Phi_3}\left( \phi_2\Phi_{3\mid 2} +\rho_{23}\phi_3\Phi_{2\mid 3}\right), \end{equation} \begin{equation} (\#eq:eq51) E[y|x]=\Phi_{12}\beta_2^\top x_2 + \sigma\left( \phi_2\Phi_{1\mid 2}+\rho_{12}\phi_1\Phi_{2\mid 1}\right), \end{equation} \begin{equation} (\#eq:eq52) E[y|x]=\Phi_2\beta_2^\top x_2 + \sigma \phi_2, \end{equation} respectively. Similarly, closed analytical forms for $E[y|x]$ can be established in the case of the logarithmic transformation $T(y_2^*)=\ln(y_2^*+\alpha)$. For trivariate, bivariate and univariate hurdle models with density function $f_+(y)$ specified by formulas \@ref(eq:denspos), \@ref(eq:densposm1), \@ref(eq:densposm3) and \@ref(eq:densposm13) these closed forms are written, respectively: \begin{equation} (\#eq:eq53) \mbox{E}[y|x] = \frac{e^{\beta_2^\top x_2 + \frac{1}{2}\sigma^2}}{\Phi_3} \Phi\left(\beta_1^\top x_1 + \rho_{12}\sigma, \frac{\beta_2^\top x_2-\ln \alpha}{\sigma} + \sigma,\beta_3^\top x_3 + \rho_{23}\sigma ; \rho\right) -\frac{\alpha}{\Phi_3}\Phi_{123}, \end{equation} with $\Phi_{123}=\Phi(\beta_1^\top x_1,(\beta_2^\top x_2-\ln \alpha)/\sigma,\beta_3^\top x_3;\rho)$, \begin{equation} (\#eq:eq54) \mbox{E}[y|x] = \frac{e^{\beta_2^\top x_2 + \frac{1}{2}\sigma^2}}{\Phi_3} \Phi\left(\frac{\beta_2^\top x_2-\ln \alpha}{\sigma} + \sigma,\beta_3^\top x_3 + \rho_{23}\sigma; \rho_{23}\right)-\frac{\alpha}{\Phi_3}\Phi_{23}, \end{equation} \begin{equation} (\#eq:eq55) \mbox{E}[y|x] = \frac{e^{\beta_2^\top x_2 + \frac{1}{2}\sigma^2}}{\Phi_3} \Phi\left(\beta_1^\top x_1 + \rho_{12}\sigma, \frac{\beta_2^\top x_2-\ln \alpha}{\sigma}+\sigma;\rho_{12}\right) -\frac{\alpha}{\Phi_3}\Phi_{12}, \end{equation} \begin{equation} (\#eq:eq56) \mbox{E}[y|x] = e^{\beta_2^\top x_2 + \frac{1}{2}\sigma^2} \Phi\left(\frac{\beta_2^\top x_2-\ln \alpha}{\sigma}+\sigma\right) -\frac{\alpha}{\Phi_3}\Phi_2. \end{equation} The marginal effects we consider refer to the change in the value of predictors \@ref(eq:eq47) and \@ref(eq:eq48) generated by a ceteris paribus increase of a model covariate, the value of the other model covariates being kept unchanged, as in a laboratory experiment. The definition of the marginal effect of a `mhurdle` covariate $x_0$ on a dependent variable of the model, depends on the scale of measure of $x_0$. When $x_0$ is a continuous quantitative variable measured either on an interval or on a ratio scale,^[A variable is said to be quantitative when the distance between two states of the variable can be measured, as for a physical magnitude, using the usual notion of distance between real numbers. This condition is fulfilled when a unit of measure, namely the distance between two particular states of the variable, can be used as a standard to express the distance between any pair of states as multiples of the unit of measure. This scale of measurement is said to be a cardinal or interval scale if the origin of the scale, numerically coded by zero, is arbitrary; it is said to be a measure or ratio scale if the origin of the scale is unique and not arbitrary; finally, it is said to be a count scale if its origin and unit of measure are both unique and not arbitrary.] the marginal effect of $x_0$ on predictor \@ref(eq:eq47) or \@ref(eq:eq48) is the first derivative of the predictor with respect to $x_0$, namely the rate of change in the value of the predictor generated by a ceteris paribus infinitesimal increase of $x_0$. Similarly, if $x_0$ is a count variable measuring the size of a population of individuals or objects which can only vary by integers, the marginal effect will be defined as the finite change in the predictor induced by a ceteris paribus unit increase of $x_0$. When $x_0$ is a qualitative explanatory variable that can be in one and only one of $K\geq2$ different mutually exclusive and exhaustive situations called attributes, it is usual to code it as a column vector of $K$ dummy variables $x_0=[d_k]_{k=1,...,K}$, taking values $d_h=1$ and $d_k=0,\forall k\neq h$ when the observed attribute of $x_0$ is the one indexed by $h$. In this case, considering $x_0$ as the first (vector) element of one of `mhurdle` covariate vectors $x_i, i=1,...,4$, the first term of the corresponding linear combination $\beta_i^\top x_i$ will take the value $\beta_{ih}$. Yet, this dummy coding of a nominal explanatory variable ^[A qualitative variable is said to be nominal if its "scale of measure" allows to determine whether two observed states are equal or different; it is said to be partially ordered if some of its attributes may also be ordered; finally, it is said to be ordered if all its attributes may be ordered.] has an annoying drawback: because the $K$ dummy variables are linearly related by the constraint $\sum_{k=1}^K d_k = 1$, the parameters $\beta_{ik},k=1,...,K$ are not identified and therefore not estimable when vector $\beta_i$ does contain an intercept term. To avoid this inconvenience, it is suggested to suppress from a $K$-attribute dummy coding one of the $K$ dummies, the discarded attribute playing the role of a "reference attribute" with respect to which the impact on $\beta_i^\top x_i$ of one of the $K-1$ remaining attributes is measured. Indeed, substituting in $\sum_{k=1}^K \beta_{ik} d_k$ dummy variable $d_K$ of reference attribute $K$ by $d_K=1-\sum_{k=1}^K d_k$, we obtain $\sum_{k=1}^{K-1} (\beta_{ik}-\beta_{iK})d_k+\beta_{iK}$, showing that the impact on $\beta_i^\top x_i$ of attribute $k$ is measured by parameter $b_k=\beta_{ik}-\beta_{iK}$ called contrast with respect to attribute $K$, while the impact coefficient $\beta_{iK}$ is included in vector $\beta_i$ as an intercept, or incorporated to an already existing intercept of this vector. Consequently, the marginal effect of a change of $x_0$, from attribute $h$ to attribute $k$, will be defined as the finite change in the predictor induced by a ceteris paribus change of the first term of $\beta_i^\top x_i$ from contrast $b_{ih}$ to contrast $b_{ik}$, with obviously $b_K=0$. When $x_0$ is a nominal qualitative variable whose $K$ attributes may be partially or totally ordered according to some meaningful criterion (for example the educational level measured and ranked according to the highest level of diploma awarded), it is possible to use a dummy coding allowing to assess the degree of monotonicity between a ceteris paribus marginal increment of $x_0$ from one attribute level to the next higher attribute level, and its impact on the dependent variable predictor. To this purpose, this impact generated by the marginal increment of $x_0$ from its attribute of rank $k-1$ to its attribute of immediately higher rank $k$ must be measured by the following marginal contrasts: $\gamma_k=\beta_{ik}-\beta_{i,k-1}, k=2,...,K$. Hence, substituting in $\sum_{k=1}^K \beta_{ik} d_k$ parameters $\beta_{ik}$ by the following solution of the former recurrence equations with respect to $\gamma_k, k=2,...,K$, namely $\beta_{i1}+\sum_{h=2}^k \gamma_h$ leads to the following recoding of the $K$-attribute dummy coding in terms of marginal contrasts: $\beta_{i1}+\sum_{k=2}^K \gamma_k \delta_k$ with $\delta_k=\sum_{h=k}^K d_h$. Consequently, the marginal effect of a change of $x_0$, from pre-ordered attributes $k-1$ to $k$, will be defined as the finite change in the predictor induced by a ceteris paribus change of the first term of $\beta_i^\top x_i$ from marginal contrast $\gamma_{k-1}$ to marginal contrast $\gamma_k$, with $\gamma_1=0$. So far, punctual prediction and marginal effects for a `mhurdle` independent variable have been tackled by assuming known the value of the model parameters. However, to be operational these formulas must be quantified by means of an estimate of these theoretical parameters. Using the maximum likelihood estimate $\hat{\theta}$ of model parameters $\theta$ presented in section 3.1, provide an implied maximum likelihood estimator for a column-vector $h(\theta)$ of best mean-square error predictors \@ref(eq:eq47), \@ref(eq:eq48), and/or marginal effects of these predictors, namely $h(\hat{\theta})$. This estimator is, as $\hat{\theta}$, a consistent and asymptotically normal estimator for $h(\theta)$ whose asymptotic distribution can be derived from a linearization of $h(\hat{\theta})$ around $\theta$ called the "delta method", which leads to the following asymptotic variance-covariance matrix of $h(\hat{\theta})$: $(1/n)(\partial h/\partial\theta^\top) I_A(\theta)^{-1} (\partial h/\partial\theta^\top)^\top$. A last comment deserves to be devoted to how to present the estimates of marginal effects which are representative for the available sample. To this purpose it is customary to provide either the marginal effects computed at the sample average value of model covariates, or the average value of the sample of marginal effects computed at the model covariates values of the individuals of the sample. The first solution must be avoided when same model covariates are qualitative, because the sample average value of a dummy coded qualitative variable is no more a qualitative variable but a vector of proportions without a meaningful interpretation for a representative individual. Regarding the second solution, it provides an informative measure of location of the marginal distribution of a marginal effect, but that is not enough to characterize, by itself, the a priori unknown profileq of this distribution. Accordingly, it should be supplemented by other indicators allowing to characterize the typical shape of the distribution, in particular the bounds of the distribution range, the standard deviation and the three quartiles. # Software rationale ```{r label = setup, include = FALSE} knitr::opts_chunk$set(echo = TRUE, cache = FALSE, message = FALSE, warning = FALSE, widtht = 65) options(width = 65) ``` To illustrate the use of `mhurdle`, we use one surveys conducted by the Bureau of Labour Statistics of the U.S. Department of Labour, called the "Interview Survey". Data from 25813 households on all expenditures are collected on a quarterly basis. They are reported on an annual basis, in thousands of USD, and divided by the number of consumption units^[Obtained by counting for one the first adult of the household, 0.7 the subsequent adults and 0.5 every other person aged under 18]. The micro-data files are publicly available on the website of the Bureau of Labour Statistics, and may be downloaded and used without permission. We use a small subset of 1000 randomly selected households. ```{r eval = TRUE, label = Data} library("mhurdle") data("Interview", package = "mhurdle") ``` The covariates are : -`income`: the anual net income by consumption unit, -`smsa`: does the household live in a SMSA (`yes` or `no`), -`age`: the age of the reference person of the household, -`educ`: the number of year of education of the reference person of the household, -`sex`: the sex of the reference person of the household (`male` and `female`), -`size`: the number of persons in the household, -`month`: the month of the interview (between 1 and 12), Among the numerous goods available in this data set, we choose the "fees and admissions" good (denoted `shows` in the data set), which is of particulary interest because the three competing hurdles are a priori relevant to explain censored observations. ```{r } round(c(mean = mean(Interview$shows), "% of 0" = mean(Interview$shows == 0), "mean pos" = mean(Interview$shows) / mean(Interview$shows > 0)), 2) ``` The consumption of this good is highly censored, as only less than a third of the households in the sample actually purchased it during the survey. The average consumption is \$120 a year, or \$380 for those who consumme. ## Estimation The estimation is performed using the `mhurdle` function, which has the following arguments: -`formula`: a formula describing the model to estimate. It should have between two and four parts on the right-hand side specifying, in the first part, the good selection equation covariates, in the second part, the desired consumption equation covariates, in the third part, the purchasing equation covariates and in the fourth part, the covariates of the variance equation. -`data`: a data frame containing the observations of the variables present in the formula. -`subset, weights, na.action`: these are arguments passed on to the `model.frame` function in order to extract the data suitable for the model. These arguments are present in the `lm` function and in most of the estimation functions. -`start`: the starting values, that can be provided by the user, which may be necessary for complicated and highly parametric models, -`dist`: this argument indicates the functional form of the desired consumption equation, which may be either log-normal `"ln"` (the default), normal `"n"`, Box-Cox `"bc"`, or inverse Hyperbolic Sine `"ihs"`, -`corr`: this boolean argument indicates whether the disturbance of the different equations are correlated, the default value is `FALSE`, -`h2`: this boolean argument indicates whether the second hurdle is effective or not, -`robust`: if `TRUE`, transformations of some parameters are used, so that they lie in the required range (positive values for the standard deviation and for the position parameter, between -1 and +1 for the coefficients of correlation), -`...`: further arguments that are passed to the optimisation function `maxLik`. For sake of clarity, we'll denote the estimated models using à 5 digits name. The first one is a capital letter indicating the distribution used (either `N` for a normal distribution and `L` for a log-normal distribution), the second, third and fourth ones are 1 if the first, second and third hurdles are present, 0 if they are not and the fifth digit is either `D` or `I` if a dependent or an independent model is estimated. To illustrate the use of `mhurdle`, we start with single equation tobit models, using a normal and a log-normal specification. ```{r label = tobit} N010I <- mhurdle(shows ~ 0 | linc + smsa + age + educ + size, data = Interview, h2 = TRUE, dist = "n", method = "bhhh") L010I <- update(N010I, dist = "ln") ``` We then consider the two other potential hurdles, namely the selection process and the infrequency of purchase. We only keep `linc` as a covariate for the desired consumption and the other covariates (`smsa`, `age`, `educ` and `size`) are used either in the first (selection model) or the third part (infrequency of purchase model) of the formula. Note that the `dist` argument is not provided (so that the default log-normal distribution is chosen) and that the lack of ressource hurdle is either maintained or removed by setting the `h2` argument respectively to `TRUE` or `FALSE` ```{r label = selection} L100D <- mhurdle(shows ~ smsa + age + educ + size | linc, data = Interview, h2 = FALSE, dist = "ln", corr = TRUE, method = "bhhh", finalHessian = TRUE) L100D2 <- update(L100D, start = coef(L100D), robust = FALSE) L110D <- update(L100D, h2 = TRUE) L110D2 <- update(L110D, start = coef(L110D), robust = FALSE) L001D <- mhurdle(shows ~ 0 | linc | smsa + age + educ + size, data = Interview, h2 = FALSE, corr = TRUE, method = "bhhh", finalHessian = TRUE) L001D2 <- update(L001D, start = coef(L001D), robust = FALSE) L011D <- update(L001D, h2 = TRUE) L011D2 <- update(L011D, start = coef(L011D), robust = FALSE) ``` Finally, we estimate three equations models. The 4 socio-economic covariates that were previously either included in the first or the third parts of the formula are now split on these two equations. More precisely, we assume that `educ` and `size` are the determinants of the selection process and that `smsa` and `size` explain the frequency of purchase. The first model is a double-hurdle model ; selection and purchasing mechanism are effective and note that in this case, the `h2` argument is set to `FALSE`). The last two ones triple-hurdle models, with correlated and uncorrelated errors: ```{r label = threeeq} L101D <- mhurdle(shows ~ educ + size | linc | smsa + age, data = Interview, h2 = FALSE, method = "bhhh", corr = TRUE, finalHessian = TRUE) L101D2 <- update(L101D, start = coef(L101D), robust = FALSE) L111D <- update(L101D, h2 = TRUE) L111D2 <- update(L111D, start = coef(L111D), robust = FALSE) L111I <- update(L111D, corr = FALSE) L111I2 <- update(L111I, start = coef(L111I), robust = FALSE) ``` The results are presented in table \@ref(tab:estimations). ```{r label = estimations, echo = FALSE, eval = TRUE, results = 'asis'} models <- list(L110D = L110D, L011D = L011D, L101D = L101D, L111D = L111D, L111I = L111I) coefs <- unique(names(Reduce("c", lapply(models, coef)))) o_h1 <- grep("h1", coefs, value = TRUE) o_h2 <- grep("h2", coefs, value = TRUE) o_h3 <- grep("h3", coefs, value = TRUE) n_h1 <- substr(o_h1, 4, 100) n_h2 <- paste(substr(o_h2, 4, 100), " ", sep = "") n_h3 <- paste(substr(o_h3, 4, 100), " ", sep = "") mps <- c(n_h1, n_h2, n_h3,"$\\sigma$", "$\\alpha$", "$\\rho_{12}$", "$\\rho_{13}$", "$\\rho_{23}$") names(mps) <- c(o_h1, o_h2, o_h3, "sd.sd", "pos", "corr12", "corr13", "corr23") gm <- tibble::tribble( ~raw, ~clean, ~fmt, "nobs", "$N$", 0, "dpar", "degree of par.", 1, "nobs.zero", "$N_o$", 0, "nobs.pos", "$N_+$", 0, "R2.zero", "$R^2_o$", 3, "R2.pos", "$R^2_+$", 3, "logLik", "log likelihood", 1) v <- modelsummary::msummary(models, statistic = NULL, estimate = "{estimate} ({std.error})", fmt = 2, title = "Multiple hurdle regressions", coef_map = mps, gof_map = gm, escape = FALSE) ## v <- kableExtra::pack_rows(v, index = c(selection = 5, consumption = 2, expense = 5, ## "Miscellanous parameters" = 5, "Goodness of fit measures" = 5)) v ``` ## Tests The position parameter ($\mu$ in table \@ref(tab:estimations) enables to test directly the hypothesis that the second hurdle is operative, using a Wald-like test. For all the models, the hypothesis that the coefficient is equal to 0 is highly rejected. Note also that the only model which impose that the second hurdle is not operative (denoted `L101D` in the table) has the worst value of log-likelihood. The coefficient of correlation is highly significant for the double-hurdle selection model, but not for the double-hurdle selection model. For the most general three equations model, only the coefficient of correlation for the first two equations is significant. Formal tests for this two nested models can be performed using a traditional log-likelihood ratio test: ```{r label = loglik, results = 'hide'} library("lmtest") lrtest(L111D, L111I) ``` or a Vuong test, which don't make the hypothesis that one of the competing model is the "true" model: ```{r label = vuongnested, results = 'hide'} vuongtest(L111D, L111I, type = "nested") ``` ```{r label = hvuongnested, include = FALSE} pvlr <- round(lrtest(L111D, L111I)[["Pr(>Chisq)"]][2], 3) pvvuong <- vuongtest(L111D, L111I, type = "nested")$p.value ``` Both tests don't reject the independence hypothesis at the 5\% level, especially the Vuong test, the p-values being respectively equal to `r pvlr` and `r pvvuong`. For non-nested models, one can use Vuong test. For example, for the double-hurdles selection and ptobit models, the test is performed using: ```{r label = vuongdh, eval = FALSE} vuongtest(L110D, L011D) ``` ```{r label hvuongunnested, include = FALSE} vt <- vuongtest(L110D, L011D) stat <- as.numeric(round(vt$statistic, 3)) pvaltrad <- round(vt$p.value, 3) ``` The statistic is `r stat`. Its negative sign indicate that the p-tobit model better fits the data compared to the selection model. The difference is significant at the 5\% level, as the (two-ways) p-values equal `r pvaltrad`, which leads to the conclusion that the ptobit model is significantly better than the selection model. The parcimony principal leads to the selection of the ptobit double-hurdle model, as it is not significantly worst than the more general three-hurdle model. # Bibliography