Title: | Fitting Linear and Generalized Linear Models to Large Data Sets |
---|---|
Description: | Fitting linear models and generalized linear models to large data sets by updating algorithms. |
Authors: | Marco Enea [aut, cre], Ronen Meiri [ctb] (on behalf of DMWay Analytics LTD), Tomer Kalimi [ctb] (on behalf of DMWay Analytics LTD) |
Maintainer: | Marco Enea <[email protected]> |
License: | GPL |
Version: | 0.3-4 |
Built: | 2024-11-16 03:33:08 UTC |
Source: | https://github.com/MarcoEnea/speedglm |
Fits Linear and Generalized Linear Models to large data sets. For data loaded in R memory the fitting is usually fast, especially if R is linked against an optimized BLAS. For data sets of size greater of R memory, the fitting is made by an updating algorithm.
Package: | speedglm |
Type: | Package |
Version: | 0.3-4 |
Date: | 2022-02-19 |
Depends: | Matrix, stats, MASS |
License: | GPL |
LazyLoad: | yes |
Marco Enea <[email protected]>, with contributions from Ronen Meiri and Tomer Kalimi (on behalf of DMWay Analytics LTD).
Maintainer: Marco Enea <[email protected]>
These are adviced to be used for speedlm
and speedglm
models fitted on moderately large data sets. It is also possible to use stepAIC function from package MASS
.
## S3 method for class 'speedlm' ## S3 method for class 'speedlm' add1(object, scope, scale = 0, test = c("none", "Chisq","F"), x = NULL, k = 2, data, ...) ## S3 method for class 'speedlm' drop1(object, scope, scale = 0, all.cols = TRUE, test = c("none","Chisq", "F"), k = 2, data, ...) ## S3 method for class 'speedlm' extractAIC(fit, scale = 0, k=2,...) ## S3 method for class 'speedlm' nobs(object, use.fallback = FALSE, ...) ## S3 method for class 'speedglm' ## S3 method for class 'speedglm' add1(object, scope, scale = 0, test = c("none", "Rao", "LRT", "Chisq", "F"), x = NULL, k = 2, weights=rep(1,object$n), ...) ## S3 method for class 'speedglm' drop1(object, scope, scale = 0, test = c("none", "Rao", "LRT", "Chisq", "F"), k = 2, weights=rep(1,object$n), ...) ## S3 method for class 'speedglm' extractAIC(fit, scale = 0, k=2,...) ## S3 method for class 'speedglm' nobs(object, use.fallback = FALSE, ...)
## S3 method for class 'speedlm' ## S3 method for class 'speedlm' add1(object, scope, scale = 0, test = c("none", "Chisq","F"), x = NULL, k = 2, data, ...) ## S3 method for class 'speedlm' drop1(object, scope, scale = 0, all.cols = TRUE, test = c("none","Chisq", "F"), k = 2, data, ...) ## S3 method for class 'speedlm' extractAIC(fit, scale = 0, k=2,...) ## S3 method for class 'speedlm' nobs(object, use.fallback = FALSE, ...) ## S3 method for class 'speedglm' ## S3 method for class 'speedglm' add1(object, scope, scale = 0, test = c("none", "Rao", "LRT", "Chisq", "F"), x = NULL, k = 2, weights=rep(1,object$n), ...) ## S3 method for class 'speedglm' drop1(object, scope, scale = 0, test = c("none", "Rao", "LRT", "Chisq", "F"), k = 2, weights=rep(1,object$n), ...) ## S3 method for class 'speedglm' extractAIC(fit, scale = 0, k=2,...) ## S3 method for class 'speedglm' nobs(object, use.fallback = FALSE, ...)
object |
a |
fit |
a |
scope |
see add1 from package |
scale |
see add1 from package |
all.cols |
see drop1 from package |
test |
see add1 from package |
x |
see add1 from package |
k |
see add1 from package |
data |
the data that the model was previously fitted to. If not provided, these will be searched in the parent environment. |
weights |
the model weights, if provided in the speedglm object |
use.fallback |
logical. Should fallback methods be used to try to guess the value? |
... |
further optional arguments. |
It is possible to use functions step() and stepAIC() for both speedlm and speedglm objects but objects fitted using updateWithMoreData()
Note that these functions have been poorly tested and need to be checked out more carefully.
Ronen Meiri and Marco Enea
## Not run: set.seed(10) n <- 1000 k <- 3 x <- round(matrix(rnorm(n * k), n, k), digits = 3) beta <- c(0.05,0.5,0.8,1.3,1.8) y <- drop(tcrossprod(cbind(1,x,x[,2]*x[,3]),t(beta))) + rnorm(n,,0.2) colnames(x) <- c("s1", "s2", "s3") da <- data.frame(y, x) m0 <- speedlm(y ~ 1, data = da,model=TRUE, y=TRUE) m0.1 <- add1(m0,scope=~(s1+s2+s3)^2, data = da) m1 <- step(m0,scope=~(s1+s2+s3)^3) m1 m2 <- speedglm(y ~ 1, data = da, model=TRUE, y=TRUE) m2.1 <- add1(m2,scope=~(s1+s2+s3)^2, data = da) m3 <- step(m2,scope=~(s1+s2+s3)^3) m3 ## End(Not run)
## Not run: set.seed(10) n <- 1000 k <- 3 x <- round(matrix(rnorm(n * k), n, k), digits = 3) beta <- c(0.05,0.5,0.8,1.3,1.8) y <- drop(tcrossprod(cbind(1,x,x[,2]*x[,3]),t(beta))) + rnorm(n,,0.2) colnames(x) <- c("s1", "s2", "s3") da <- data.frame(y, x) m0 <- speedlm(y ~ 1, data = da,model=TRUE, y=TRUE) m0.1 <- add1(m0,scope=~(s1+s2+s3)^2, data = da) m1 <- step(m0,scope=~(s1+s2+s3)^3) m1 m2 <- speedglm(y ~ 1, data = da, model=TRUE, y=TRUE) m2.1 <- add1(m2,scope=~(s1+s2+s3)^2, data = da) m3 <- step(m2,scope=~(s1+s2+s3)^3) m3 ## End(Not run)
Utility functions for least squares estimation in large data sets.
control(B, symmetric = TRUE, tol.values = 1e-7, tol.vectors = 1e-7, out.B = TRUE, method = c("eigen", "Cholesky")) cp(X, w = NULL, row.chunk = NULL, sparse = FALSE) is.sparse(X, sparselim = .9, camp = .05)
control(B, symmetric = TRUE, tol.values = 1e-7, tol.vectors = 1e-7, out.B = TRUE, method = c("eigen", "Cholesky")) cp(X, w = NULL, row.chunk = NULL, sparse = FALSE) is.sparse(X, sparselim = .9, camp = .05)
B |
a squared matrix. |
symmetric |
logical, is |
tol.values |
tolerance to be consider eigenvalues equals to zero. |
tol.vectors |
tolerance to be consider eigenvectors equals to zero. |
out.B |
Have the matrix B to be returned? |
method |
the method to check for singularity. By default is "eigen", and an eigendecomposition of X'X is made. The "Cholesky" method is faster than "eigen" and does not use tolerance, but the former seems to be more stable for opportune tolerance values. |
X |
the model matrix. |
w |
a weights vector. |
sparse |
logical, is |
sparselim |
a real in the interval [0; 1]. It indicates the minimal proportion of zeroes in the data matrix X in order to consider X as sparse |
eigendec Logical. Do you want to investigate on rank of X? You may set to
row.chunk |
an integer which indicates the total rows number
compounding each of the first g-1 blocks. If |
camp |
the sample proportion of elements of X on which the survey will be based. |
Function control
makes an eigendecomposition of B according established values of tolerance.
Function cp
makes the cross-product X'X by partitioning X in row-blocks.
When an optimized BLAS, such as ATLAS, is not installed, the function represents an attempt
to speed up the calculation and avoid overflows with medium-large data sets loaded in R memory.
The results depending on processor type. Good results are obtained, for example, with an AMD Athlon
dual core 1.5 Gb RAM by setting row.chunk
to some value less than 1000. Try the example below
by changing the matrix size and the value of row.chunk
. If the matrix X is sparse, it will have
class "dgCMatrix" (the package Matrix is required) and the cross-product will be made without
partitioning. However, good performances are usually obtained with a very
high zeroes proportion.
Function is.sparse
makes a quick sample survey on sample proportion of zeroes in X.
for the function control
, a list with the following elements:
XTX |
the matrix product B without singularities (if there are). |
rank |
the rank of B |
pivot |
an ordered set of column indeces of B with, if the case, the last |
for the function cp
:
new.B |
the matrix product X'X (weighted, if |
for the function is.sparse
:
sparse |
a logical value which indicates if the sample proportion of zeroes is
greater than |
Marco ENEA
#### example 1. n <- 100000 k <- 100 x <- round(matrix(rnorm(n*k),n,k),digits=4) y <- rnorm(n) # if an optimized BLAS is not installed, depending on processor type, cp() may be # faster than crossprod() for large matrices. system.time(a1 <- crossprod(x)) system.time(a2 <- cp(x,,row.chunk = 500)) all.equal(a1, a2) #### example 2.1. n <- 100000 k <- 10 x <- matrix(rnorm(n*k),n,k) x[,2] <- x[,1] + 2*x[,3] # x has rank 9 y <- rnorm(n) # estimation by least squares A <- function(){ A1 <- control(crossprod(x)) ok <- A1$pivot[1:A1$rank] as.vector(solve(A1$XTX,crossprod(x[,ok],y))) } # estimation by QR decomposition B <- function(){ B1 <- qr(x) qr.solve(x[,B1$pivot[1:B1$rank]],y) } system.time(a <- A()) system.time(b <- B()) all.equal(a,b) ### example 2.2 x <- matrix(c(1:5, (1:5)^2), 5, 2) x <- cbind(x, x[, 1] + 3*x[, 2]) m <- crossprod(x) qr(m)$rank # is 2, as it should be control(m,method="eigen")$rank # is 2, as it should be control(m,method="Cholesky")$rank # is wrong ### example 3. n <- 10000 fat1 <- gl(20,500) y <- rnorm(n) da <- data.frame(y,fat1) m <- model.matrix(y ~ factor(fat1),data = da) is.sparse(m)
#### example 1. n <- 100000 k <- 100 x <- round(matrix(rnorm(n*k),n,k),digits=4) y <- rnorm(n) # if an optimized BLAS is not installed, depending on processor type, cp() may be # faster than crossprod() for large matrices. system.time(a1 <- crossprod(x)) system.time(a2 <- cp(x,,row.chunk = 500)) all.equal(a1, a2) #### example 2.1. n <- 100000 k <- 10 x <- matrix(rnorm(n*k),n,k) x[,2] <- x[,1] + 2*x[,3] # x has rank 9 y <- rnorm(n) # estimation by least squares A <- function(){ A1 <- control(crossprod(x)) ok <- A1$pivot[1:A1$rank] as.vector(solve(A1$XTX,crossprod(x[,ok],y))) } # estimation by QR decomposition B <- function(){ B1 <- qr(x) qr.solve(x[,B1$pivot[1:B1$rank]],y) } system.time(a <- A()) system.time(b <- B()) all.equal(a,b) ### example 2.2 x <- matrix(c(1:5, (1:5)^2), 5, 2) x <- cbind(x, x[, 1] + 3*x[, 2]) m <- crossprod(x) qr(m)$rank # is 2, as it should be control(m,method="eigen")$rank # is 2, as it should be control(m,method="Cholesky")$rank # is wrong ### example 3. n <- 10000 fat1 <- gl(20,500) y <- rnorm(n) da <- data.frame(y,fat1) m <- model.matrix(y ~ factor(fat1),data = da) is.sparse(m)
The data1
dataset has 100 rows and 4 columns.
data(data1)
data(data1)
A data frame with 100 observations on the following 4 variables.
y
a gamma-distributed response variable
fat1
a four-level factor
x1
a numeric covariate
x2
a numeric covariate
This is a toy dataset used to show how function shglm
works.
data(data1)
data(data1)
summary
The method is currently under construction but some functionalities are available.
## S3 method for class 'speedglm' predict(object, newdata, type = c("link", "response"), na.action = na.pass, ...)
## S3 method for class 'speedglm' predict(object, newdata, type = c("link", "response"), na.action = na.pass, ...)
object |
an object of class 'speedlgm'. |
newdata |
An optional data frame with new data or the original data. |
type |
Type of prediction. |
na.action |
function determining what should be done with missing values in |
... |
further optional arguments |
If newdata
is omitted prediction are based on the data used for the fit only if argument fitted
was previously set to TRUE
in the speedglm object.
Currently the method does not work for function shglm
.
pred |
a vector of predictions. |
Tomer Kalimi and Marco Enea
set.seed(10) y <- rgamma(20,1.5,1) x <-round( matrix(rnorm(20*2),20,2),digits=3) colnames(x) <-c("s1","s2") da <- data.frame(y,x) mod <- speedglm(y~s1+s2, data=da, family=Gamma(log), fitted=TRUE) predict(mod)
set.seed(10) y <- rgamma(20,1.5,1) x <-round( matrix(rnorm(20*2),20,2),digits=3) colnames(x) <-c("s1","s2") da <- data.frame(y,x) mod <- speedglm(y~s1+s2, data=da, family=Gamma(log), fitted=TRUE) predict(mod)
summary
The method is currently under construction but some functionalities are available.
## S3 method for class 'speedlm' predict(object, newdata, na.action = na.pass, ...)
## S3 method for class 'speedlm' predict(object, newdata, na.action = na.pass, ...)
object |
an object of class 'speedlm'. |
newdata |
An optional data frame with new data or the original data. |
na.action |
function determining what should be done with missing values in
|
... |
further optional arguments |
If newdata
is omitted prediction are based on the data used for the fit only if argument fitted
was previously set to TRUE
in the speedlm object.
predictor |
a vector of predictions. |
Tomer Kalimi and Marco Enea
set.seed(10) x <- round( matrix(rnorm(20*3),20,3),digits=3) colnames(x) <-c("y","s1","s2") da <- as.data.frame(x) mod <- speedlm(y~s1+s2, data=da, fitted=TRUE) predict(mod)
set.seed(10) x <- round( matrix(rnorm(20*3),20,3),digits=3) colnames(x) <-c("y","s1","s2") da <- as.data.frame(x) mod <- speedlm(y~s1+s2, data=da, fitted=TRUE) predict(mod)
speedglm
and speedglm.wfit
fit GLMs to medium-large
data sets, that is those storable into the R memory. The highest performances, in terms of computation time,
are obtained when R is linked against an optimized BLAS, such as ATLAS. The function shglm
is for a data set stored into a file of size greater than the available memory, and takes as
argument a function to manipulate connections.
## S3 method for class 'data.frame': speedglm(formula,data,family=gaussian(),weights=NULL,start=NULL, etastart=NULL,mustart=NULL,offset=NULL,maxit=25, k=2, sparse=NULL,set.default=list(), trace=FALSE, method=c('eigen','Cholesky','qr'), model=FALSE, y=FALSE, fitted=FALSE,...) ## S3 method for class 'matrix': speedglm.wfit(y, X, intercept=TRUE, weights=NULL,row.chunk=NULL, family=gaussian(), start=NULL, etastart=NULL, mustart=NULL, offset=NULL, acc=1e-08, maxit=25, k=2, sparselim=.9,camp=.01, eigendec=TRUE, tol.values=1e-7, tol.vectors=1e-7, tol.solve=.Machine$double.eps, sparse=NULL,method = c('eigen','Cholesky','qr'), trace=FALSE,...) ## S3 method for class 'function': shglm(formula, datafun, family = gaussian(), weights.fo = NULL, start = NULL, etastart = NULL, mustart = NULL, offset = NULL, maxit = 25, k = 2, chunksize = 5000, sparse = NULL, trace = FALSE, all.levels = FALSE, set.default = list(),...)
## S3 method for class 'data.frame': speedglm(formula,data,family=gaussian(),weights=NULL,start=NULL, etastart=NULL,mustart=NULL,offset=NULL,maxit=25, k=2, sparse=NULL,set.default=list(), trace=FALSE, method=c('eigen','Cholesky','qr'), model=FALSE, y=FALSE, fitted=FALSE,...) ## S3 method for class 'matrix': speedglm.wfit(y, X, intercept=TRUE, weights=NULL,row.chunk=NULL, family=gaussian(), start=NULL, etastart=NULL, mustart=NULL, offset=NULL, acc=1e-08, maxit=25, k=2, sparselim=.9,camp=.01, eigendec=TRUE, tol.values=1e-7, tol.vectors=1e-7, tol.solve=.Machine$double.eps, sparse=NULL,method = c('eigen','Cholesky','qr'), trace=FALSE,...) ## S3 method for class 'function': shglm(formula, datafun, family = gaussian(), weights.fo = NULL, start = NULL, etastart = NULL, mustart = NULL, offset = NULL, maxit = 25, k = 2, chunksize = 5000, sparse = NULL, trace = FALSE, all.levels = FALSE, set.default = list(),...)
Most of arguments are the same of glm or bigglm but with some difference.
formula |
the same of |
data |
a data frame. |
datafun |
a function which uses connections. See the example below. |
family |
the same of |
start |
the same of |
weights |
the same of |
weights.fo |
weights for the response. It must be specified as a formula (see the example below). |
etastart |
the same of |
mustart |
the same of |
offset |
the same of |
intercept |
the same of |
X |
the same of |
y |
the same of |
maxit |
the same of |
k |
numeric, the penalty per parameter to be used; the default k = 2 is the classical AIC. |
trace |
logical. Do you want to be informed about the model estimation progress? |
sparse |
logical. Is the model matrix sparse? By default is NULL, so a quickly sample survey will be made. |
chunksize |
an integer indicates the number of rows of the data file to read at time. |
all.levels |
logical, are all factor's levels present in each data chunk? |
set.default |
a list in which to specify the below parameters. |
sparselim |
a real in the interval [0, 1]. It indicates the minimal proportion of zeroes in the data matrix X in order to consider X as sparse. |
camp |
see the function is.sparse. |
eigendec |
logical. Do you want to check the rank of X? You may set it to false if you are sure that X is full rank. |
row.chunk |
an integer, see the function cp for details. |
acc |
tolerance to be used for the estimation. |
tol.solve |
see the function solve. |
tol.values |
see the function control. |
tol.vectors |
see the function control. |
method |
the chosen method to detect for singulatity. |
model |
logical. If TRUE the model frame will be returned. |
fitted |
logical. If TRUE the fitted values will be returned. |
... |
further optional arguments. |
The function shglm
works like
biglm
, but it checks for singularity and does not impose restrictions on
factors. Since during the IWLS estimation shglm
uses repeated accesses to data file stored, for example,
into the hard disk, the estimation time could be very long.
Unlike from glm
or biglm
, the functions of class 'speedglm' do not use
the QR decomposition, but directly solve the equations in the form of Iterative(-ly) (Re-)Weighted Least Squares (IWLS).
The memory size of an object of class 'speedglm' is , where
is the number of covariates, unless one or more of argument
model
, y
and fitted
are set to TRUE. If an optimized BLAS
is not installed, an attempt to speed up calculations might be done by setting row.chunk
to some value, usually less than 1000, in set.default
. See the function cp for details.
If the model matrix is (very) sparse, the package Matrix
could be used.
Note that if method 'qr' is chosen, then the qr decomposition will not be applied on matrix X, as in lm
,
but on X'WX.
coefficients |
the estimated coefficients. |
logLik |
the log likelihood of the fitted model. |
iter |
the number of iterations of IWLS used. |
tol |
the maximal value of tolerance reached. |
convergence |
a logical value which indicates if convergence was reached. |
family |
the family object used. |
link |
the link function used. |
df |
the degrees of freedom of the model. |
XTX |
the product X'X (weighted, if the case). |
dispersion |
the estimated dispersion parameter of the model. |
ok |
the set of column indeces of the model matrix where the model has been fitted. |
rank |
the rank of the model matrix. |
RSS |
the estimated residual sum of squares of the fitted model. |
aic |
the estimated Akaike Information Criterion. |
sparse |
a logical value which indicates if the model matrix is sparse. |
deviance |
the estimated deviance of the fitted model. |
nulldf |
the degrees of freedom of the null model. |
nulldev |
the estimated deviance of the null model. |
ngoodobs |
the number of non-zero weighted observations. |
n |
the number of observations. |
intercept |
a logical value which indicates if an intercept has been used. |
terms |
the terms object used. |
call |
the matched call. |
model |
Either NULL or, if |
y |
Either NULL or, if |
linear.predictors |
Either NULL or, if |
offset |
the model offset. |
All the above functions make an object of class 'speedglm'.
In the current package version, arguments start
, mustart
and etastart
of function shglm
have been disabled.
These will be restored in future.
Marco Enea. Ronen Meiri contributed with method 'qr'
Enea, M. (2009) Fitting Linear Models and Generalized Linear Models with large data sets in R.
In book of short papers, conference on “Statistical Methods for the analysis of large data-sets”,
Italian Statistical Society, Chieti-Pescara, 23-25 September 2009, 411-414.
Bates, D. (2009) Comparing Least Square Calculations. Technical report.
Lumley, T. (2009) biglm: bounded memory linear and generalized linear models. R package version 0.7. https://CRAN.R-project.org/package=biglm.
## Not run: # The following comparison among glm(), bigglm() and speedglm() cannot be considered rigorous # and exhaustive, but it is only to give an idea of the computation time. # It may take a long time. require(biglm) n<-50000 k<-80 y <- rgamma(n,1.5,1) x <-round( matrix(rnorm(n*k),n,k),digits=3) colnames(x) <-paste("s",1:k,sep = "") da<- data.frame(y,x) fo <- as.formula(paste("y~",paste(paste("s",1:k,sep=""),collapse="+"))) system.time(m1 <- glm(fo,data=da,family=Gamma(log))) system.time(m2 <- bigglm(fo,data=da,family=Gamma(log))) system.time(m3 <- speedglm(fo,data=da,family=Gamma(log))) # You may also try speedglm when R is linked against an optimized BLAS, # otherwise try to run the following function. In some computers, it is # faster for large data sets. system.time(m4 <- speedglm(fo,data=da,family=Gamma(log),set.default=list(row.chunk=1000))) ## End(Not run) ################## ## Not run: ## An example of function using a connection to an out-memory file ## This is a slightly modified version of the function from the bigglm's help page make.data<-function(filename, chunksize,...){ conn<-NULL function(reset=FALSE){ if(reset){ if(!is.null(conn)) close(conn) conn<<-file(filename,open="r") } else{ rval<-read.table(conn, nrows=chunksize,...) if ((nrow(rval)==0)) { close(conn) conn<<-NULL rval<-NULL } return(rval) } } } # data1 is a small toy dataset data(data1) write.table(data1,"data1.txt",row.names=FALSE,col.names=FALSE) rm(data1) da<-make.data("data1.txt",chunksize=50,col.names=c("y","fat1","x1","x2")) # Caution! make sure to close the connection once you have run command #1 da(reset=T) #1: opens the connection to "data1.txt" da(reset=F) #2: reads the first 50 rows (out of 100) of the dataset da(reset=F) #3: reads the second 50 rows (out of 100) of the dataset da(reset=F) #4: is NULL: this latter command closes the connection require(biglm) # fat1 is a factor with four levels b1<-shglm(y~factor(fat1)+x1,weights=~I(x2^2),datafun=da,family=Gamma(log)) b2<-bigglm(y~factor(fat1)+x1,weights=~I(x2^2),data=da,family=Gamma(log)) summary(b1) summary(b2) file.remove("data1.txt") ## End(Not run)
## Not run: # The following comparison among glm(), bigglm() and speedglm() cannot be considered rigorous # and exhaustive, but it is only to give an idea of the computation time. # It may take a long time. require(biglm) n<-50000 k<-80 y <- rgamma(n,1.5,1) x <-round( matrix(rnorm(n*k),n,k),digits=3) colnames(x) <-paste("s",1:k,sep = "") da<- data.frame(y,x) fo <- as.formula(paste("y~",paste(paste("s",1:k,sep=""),collapse="+"))) system.time(m1 <- glm(fo,data=da,family=Gamma(log))) system.time(m2 <- bigglm(fo,data=da,family=Gamma(log))) system.time(m3 <- speedglm(fo,data=da,family=Gamma(log))) # You may also try speedglm when R is linked against an optimized BLAS, # otherwise try to run the following function. In some computers, it is # faster for large data sets. system.time(m4 <- speedglm(fo,data=da,family=Gamma(log),set.default=list(row.chunk=1000))) ## End(Not run) ################## ## Not run: ## An example of function using a connection to an out-memory file ## This is a slightly modified version of the function from the bigglm's help page make.data<-function(filename, chunksize,...){ conn<-NULL function(reset=FALSE){ if(reset){ if(!is.null(conn)) close(conn) conn<<-file(filename,open="r") } else{ rval<-read.table(conn, nrows=chunksize,...) if ((nrow(rval)==0)) { close(conn) conn<<-NULL rval<-NULL } return(rval) } } } # data1 is a small toy dataset data(data1) write.table(data1,"data1.txt",row.names=FALSE,col.names=FALSE) rm(data1) da<-make.data("data1.txt",chunksize=50,col.names=c("y","fat1","x1","x2")) # Caution! make sure to close the connection once you have run command #1 da(reset=T) #1: opens the connection to "data1.txt" da(reset=F) #2: reads the first 50 rows (out of 100) of the dataset da(reset=F) #3: reads the second 50 rows (out of 100) of the dataset da(reset=F) #4: is NULL: this latter command closes the connection require(biglm) # fat1 is a factor with four levels b1<-shglm(y~factor(fat1)+x1,weights=~I(x2^2),datafun=da,family=Gamma(log)) b2<-bigglm(y~factor(fat1)+x1,weights=~I(x2^2),data=da,family=Gamma(log)) summary(b1) summary(b2) file.remove("data1.txt") ## End(Not run)
The functions of class 'speedlm' may speed up the fitting of LMs to large data sets. High performances can be obtained especially if R is linked against an optimized BLAS, such as ATLAS.
# S3 method of class 'data.frame' speedlm(formula, data, weights = NULL, offset = NULL, sparse = NULL, set.default = list(), method=c('eigen','Cholesky','qr'), model = FALSE, y = FALSE, fitted = FALSE, subset=NULL, ...) # S3 method of class 'matrix' speedlm.fit(y, X, intercept = FALSE, offset = NULL, row.chunk = NULL, sparselim = 0.9, camp = 0.01, eigendec = TRUE, tol.solve = .Machine$double.eps, sparse = NULL, tol.values = 1e-07, tol.vectors = 1e-07, method=c('eigen','Cholesky','qr'), ...) speedlm.wfit(y, X, w, intercept = FALSE, offset = NULL, row.chunk = NULL, sparselim = 0.9, camp = 0.01, eigendec = TRUE, tol.solve = .Machine$double.eps, sparse = NULL, tol.values = 1e-07, tol.vectors = 1e-07, method=c('eigen','Cholesky','qr'), ...) # S3 method of class 'speedlm' (object) and 'data.frame' (data) ## S3 method for class 'speedlm' update(object, formula, data, add=TRUE, evaluate=TRUE, subset=NULL, offset=NULL, weights=NULL,...) # S3 method of class 'speedlm' (object) and 'data.frame' (data) updateWithMoreData(object, data, weights = NULL, offset = NULL, sparse = NULL, all.levels = FALSE, set.default = list(), subset=NULL,...)
# S3 method of class 'data.frame' speedlm(formula, data, weights = NULL, offset = NULL, sparse = NULL, set.default = list(), method=c('eigen','Cholesky','qr'), model = FALSE, y = FALSE, fitted = FALSE, subset=NULL, ...) # S3 method of class 'matrix' speedlm.fit(y, X, intercept = FALSE, offset = NULL, row.chunk = NULL, sparselim = 0.9, camp = 0.01, eigendec = TRUE, tol.solve = .Machine$double.eps, sparse = NULL, tol.values = 1e-07, tol.vectors = 1e-07, method=c('eigen','Cholesky','qr'), ...) speedlm.wfit(y, X, w, intercept = FALSE, offset = NULL, row.chunk = NULL, sparselim = 0.9, camp = 0.01, eigendec = TRUE, tol.solve = .Machine$double.eps, sparse = NULL, tol.values = 1e-07, tol.vectors = 1e-07, method=c('eigen','Cholesky','qr'), ...) # S3 method of class 'speedlm' (object) and 'data.frame' (data) ## S3 method for class 'speedlm' update(object, formula, data, add=TRUE, evaluate=TRUE, subset=NULL, offset=NULL, weights=NULL,...) # S3 method of class 'speedlm' (object) and 'data.frame' (data) updateWithMoreData(object, data, weights = NULL, offset = NULL, sparse = NULL, all.levels = FALSE, set.default = list(), subset=NULL,...)
Most of arguments are the same of functions lm but with some difference.
formula |
the same of function |
data |
the same of function |
weights |
the same of function |
w |
the same of |
intercept |
a logical value which indicates if an intercept is used. |
offset |
the same of function |
X |
the same of |
y |
the same of |
sparse |
logical. Is the model matrix sparse? By default is NULL, so a quickly sample survey will be made. |
set.default |
a list in which to specify the parameters to pass to the functions cp, control and is.sparse. |
sparselim |
a value in the interval [0, 1]. It indicates the minimal proportion of zeroes, in the model matrix X, in order to consider X as sparse. |
camp |
see function |
eigendec |
logical. Do you want to investigate on rank of X? You may set it to false if you are sure that X is full rank. |
row.chunk |
an integer, see the function |
tol.solve |
see function solve. |
tol.values |
see function control. |
tol.vectors |
see function control. |
method |
see function control. |
object |
an object of class 'speedlm'. |
all.levels |
are all levels of eventual factors present in each data
chunk? If so, set |
model |
logical. Should the model frame be returned? |
fitted |
logical. Should the fitted values be returned? |
subset |
the same of function |
add |
logical. Are additional data coming from a new chunk provided? |
evaluate |
logical. If true evaluate the new call else return the call. |
... |
further optional arguments. |
Unlikely from lm or biglm, the functions of class 'speedlm' do not use
the QR decomposition but directly solve the normal equations. Further, the most recent version of the package include method 'qr'. However such qr decomposition is not applied directly on matrix X, but on X'WX.
In some extreme case, this might have some problem of numerical stability but may take advantage from the use of
an optimized BLAS. The memory size of an object of class 'speedlm' is , where
is the number of covariates.
If an optimized BLAS library is not installed, an attempt to speed up calculations may be done by setting
row.chunk
to some value, usually less than 1000, in set.default
. See the function cp for details. Factors are permitted
without limitations.
In the most recent versions, function update.speedlm
is now a wrapper to call either updateWithMoreData
(the new name of the old update.speedlm
, for additional data chunks), or update from package stats
.
coefficients |
the estimated coefficients. |
df.residual |
the residual degrees of freedom. |
XTX |
the product X'X (weighted, if the case). |
A |
the product X'X (weighted, if the case) not checked for singularity. |
Xy |
the product X'y (weighted, if the case). |
ok |
the set of column indeces of the model matrix where the model has been fitted. |
rank |
the numeric rank of the fitted linear model. |
pivot |
see the function control. |
RSS |
the estimated residual sums of squares of the fitted model. |
sparse |
a logical value indicating if the model matrix is sparse. |
deviance |
the estimated deviance of the fitted model. |
weigths |
the weights used in the last updating. |
zero.w |
the number of non-zero weighted observations. |
nobs |
the number of observations. |
nvar |
the number of independent variables. |
terms |
the |
intercept |
a logical value which indicates if an intercept has been used. |
call |
the matched call. |
model |
Either NULL or the model frame, if |
y |
Either NULL or the response variable, if |
fitted.values |
Either NULL or the fitted values, if |
offset |
the model offset. |
... |
others values necessary to update the estimation. |
All the above functions make an object of class 'speedlm'.
Marco Enea, with contribution from Ronen Meiri.
Enea, M. (2009) Fitting Linear Models and Generalized Linear Models With Large Data Sets in R.
In book of short papers, conference on “Statistical Methods for the analysis of large data-sets”,
Italian Statistical Society, Chieti-Pescara, 23-25 September 2009, 411-414.
Klotz, J.H. (1995) Updating Simple Linear Regression. Statistica Sinica, 5, 399-403.
Bates, D. (2009) Comparing Least Square Calculations. Technical report.
Lumley, T. (2009) biglm: bounded memory linear and generalized linear models. R package version 0.7 https://CRAN.R-project.org/package=biglm.
summary.speedlm,speedglm, lm, and biglm
## Not run: n <- 1000 k <- 3 y <- rnorm(n) x <- round(matrix(rnorm(n * k), n, k), digits = 3) colnames(x) <- c("s1", "s2", "s3") da <- data.frame(y, x) do1 <- da[1:300,] do2 <- da[301:700,] do3 <- da[701:1000,] m1 <- speedlm(y ~ s1 + s2 + s3, data = do1) m1 <- update(m1, data = do2) m1 <- update(m1, data = do3) m2 <- lm(y ~ s1 + s2 + s3, data = da) summary(m1) summary(m2) ## End(Not run) ## Not run: # as before but recursively make.data <- function(filename, chunksize,...){ conn <- NULL function(reset=FALSE, header=TRUE){ if(reset){ if(!is.null(conn)) close(conn) conn<<-file(filename,open="r") } else{ rval <- read.table(conn, nrows=chunksize,header=header,...) if (nrow(rval)==0) { close(conn) conn<<-NULL rval<-NULL } return(rval) } } } write.table(da,"da.txt",col.names=TRUE,row.names=FALSE,quote=FALSE) x.names <- c("s1", "s2", "s3") dat <- make.data("da.txt",chunksize=300,col.names=c("y",x.names)) dat(reset=TRUE) da2 <- dat(reset=FALSE) # the first model runs on the first 300 rows. m3 <- speedlm(y ~ s1 + s2 + s3, data=da2) # the last three models run on the subsequent 300, 300 and 100 rows, respectively for (i in 1:3){ da2 <- dat(reset=FALSE, header=FALSE) m3 <- update(m3, data=da2, add=TRUE) } all.equal(coef(m1),coef(m3)) file.remove("da.txt") ## End(Not run)
## Not run: n <- 1000 k <- 3 y <- rnorm(n) x <- round(matrix(rnorm(n * k), n, k), digits = 3) colnames(x) <- c("s1", "s2", "s3") da <- data.frame(y, x) do1 <- da[1:300,] do2 <- da[301:700,] do3 <- da[701:1000,] m1 <- speedlm(y ~ s1 + s2 + s3, data = do1) m1 <- update(m1, data = do2) m1 <- update(m1, data = do3) m2 <- lm(y ~ s1 + s2 + s3, data = da) summary(m1) summary(m2) ## End(Not run) ## Not run: # as before but recursively make.data <- function(filename, chunksize,...){ conn <- NULL function(reset=FALSE, header=TRUE){ if(reset){ if(!is.null(conn)) close(conn) conn<<-file(filename,open="r") } else{ rval <- read.table(conn, nrows=chunksize,header=header,...) if (nrow(rval)==0) { close(conn) conn<<-NULL rval<-NULL } return(rval) } } } write.table(da,"da.txt",col.names=TRUE,row.names=FALSE,quote=FALSE) x.names <- c("s1", "s2", "s3") dat <- make.data("da.txt",chunksize=300,col.names=c("y",x.names)) dat(reset=TRUE) da2 <- dat(reset=FALSE) # the first model runs on the first 300 rows. m3 <- speedlm(y ~ s1 + s2 + s3, data=da2) # the last three models run on the subsequent 300, 300 and 100 rows, respectively for (i in 1:3){ da2 <- dat(reset=FALSE, header=FALSE) m3 <- update(m3, data=da2, add=TRUE) } all.equal(coef(m1),coef(m3)) file.remove("da.txt") ## End(Not run)
summary
method for the class 'speedglm'.
## S3 method for class 'speedglm' summary(object,correlation=FALSE,...) ## S3 method for class 'speedglm' coef(object,...) ## S3 method for class 'speedglm' vcov(object,...) ## S3 method for class 'speedglm' logLik(object,...) ## S3 method for class 'speedglm' AIC(object,...)
## S3 method for class 'speedglm' summary(object,correlation=FALSE,...) ## S3 method for class 'speedglm' coef(object,...) ## S3 method for class 'speedglm' vcov(object,...) ## S3 method for class 'speedglm' logLik(object,...) ## S3 method for class 'speedglm' AIC(object,...)
object |
an object of class 'speedglm'. |
correlation |
logical. Do you want to print the correlation matrix? By default it is false. |
... |
further optional arguments |
coefficients |
the matrix of coefficients, standard errors, z-statistics and two-side p-values. |
df.residual |
the component from object. |
df.null |
the component from object. |
null.deviance |
the component from object. |
deviance |
the component from object. |
family |
the component from object. |
call |
the component from object. |
AIC |
the Akaike Information Criterion. |
RSS |
Residuals sums of squares. |
correlation |
(only if |
logLik |
the log-likelihood value. |
rank |
the component from object. |
dispersion |
the estimated dispersion parameter of the fitted model. |
convergence |
the component from object. |
iter |
the component from object. |
tol |
the component from object. |
Marco ENEA
n<-1000 k<-5 y <- rgamma(n,1.5,1) x <-round( matrix(rnorm(n*k),n,k),digits=3) colnames(x) <-paste("s",1:k,sep = "") da<- data.frame(y,x) fo <- as.formula(paste("y~",paste(paste("s",1:k,sep=""),collapse="+"))) m4 <- speedglm(fo,data=da,family=Gamma(log)) summary(m4)
n<-1000 k<-5 y <- rgamma(n,1.5,1) x <-round( matrix(rnorm(n*k),n,k),digits=3) colnames(x) <-paste("s",1:k,sep = "") da<- data.frame(y,x) fo <- as.formula(paste("y~",paste(paste("s",1:k,sep=""),collapse="+"))) m4 <- speedglm(fo,data=da,family=Gamma(log)) summary(m4)
summary
method for class 'speedlm'.
## S3 method for class 'speedlm' summary(object, correlation = FALSE,...) ## S3 method for class 'speedlm' coef(object,...) ## S3 method for class 'speedlm' vcov(object,...) ## S3 method for class 'speedlm' logLik(object,...) ## S3 method for class 'speedlm' AIC(object,...,k = 2)
## S3 method for class 'speedlm' summary(object, correlation = FALSE,...) ## S3 method for class 'speedlm' coef(object,...) ## S3 method for class 'speedlm' vcov(object,...) ## S3 method for class 'speedlm' logLik(object,...) ## S3 method for class 'speedlm' AIC(object,...,k = 2)
object |
an object of class 'speedlm'. |
correlation |
logical. Do you want to print the correlation matrix? By default it is false. |
k |
numeric, the penalty per parameter to be used; the default k = 2 is the classical AIC. |
... |
further optional arguments |
coefficients |
the matrix of coefficients, standard errors, t-statistics and two-side p-values. |
rdf |
degrees of freedom of the fitted model. It is a component from |
call |
the component from |
r.squared |
R^2, the fraction of variance explained by the model. |
adj.r.squared |
the "adjusted" R^2 statistic, penalizing for higher p. |
fstatistic |
(for models including non-intercept terms) a 3-vector with the value of the F-statistic with its numerator and denominator degrees of freedom. |
f.pvalue |
p-value of the F-statistic. |
RSS |
Residual sum of squares. |
var.res |
estimated variance of residuals. |
rank |
the component from |
correlation |
(only if |
... |
the results from the functions |
Marco ENEA
y <- rnorm(100,1.5,1) x <- round(matrix(rnorm(200), 100, 2), digits = 3) colnames(x) <- c("s1","s2") da <- data.frame(y, x) m <- speedlm(y ~ s1 + s2,da) summary(m)
y <- rnorm(100,1.5,1) x <- round(matrix(rnorm(200), 100, 2), digits = 3) colnames(x) <- c("s1","s2") da <- data.frame(y, x) m <- speedlm(y ~ s1 + s2,da) summary(m)