Title: | Supervised Component Generalized Linear Regression |
---|---|
Description: | An extension of the Fisher Scoring Algorithm to combine PLS regression with GLM estimation in the multivariate context. Covariates can also be grouped in themes. |
Authors: | Guillaume Cornu [aut, cre] |
Maintainer: | Guillaume Cornu <[email protected]> |
License: | CeCILL-2 | GPL-2 |
Version: | 3.0.9000 |
Built: | 2025-02-24 16:43:47 UTC |
Source: | https://github.com/scnext/scglr |
Auxiliary function for scglr
fitting used to
construct a convergence control argument.
critConvergence(tol = 1e-06, maxit = 50)
critConvergence(tol = 1e-06, maxit = 50)
tol |
positive convergence threshold. |
maxit |
integer, maximum number of iterations. |
a list containing elements named as the arguments.
Parameters used to choose what to plot and how. These parameters are given to plot.SCGLR
and pairs.SCGLR
.
Parameter name can be abbreviated (e.g. pred.col
will be understood as predictors.color).
Options can be set globally using options("plot.SCGLR")
. It will then provide default values that can be further overriden by giving explicit parameter value.
parameter name | type (default value). Description. |
title |
string (NULL). Main title of plot (override built-in). |
labels.auto |
logical (TRUE). Should covariate or predictor labels be aligned with arrows. |
labels.offset |
numeric (0.01). Offset by which labels should be moved from tip of arrows. |
labels.size |
numeric (1). Relative size for labels. Use it to globally alter label size. |
expand |
numeric (1). Expand factor for windows size. Use it for example to make room for clipped labels. |
threshold |
numeric. All covariates and/or predictors whose sum of square correlations with the two components of the plane lower than this threshold will be ignored. |
observations |
logical (FALSE). Should we draw observations. |
observations.size |
numeric (1). Point size. |
observations.color |
character ("black"). Point color. |
observations.alpha |
numeric (1). Point transparency. |
observations.factor |
logical (FALSE). Paint observations according to factor (specify factor). |
predictors |
logical or array of characters or comma separated string (FALSE). Should we draw predictors and optionally which one (TRUE means all). |
predictors.color |
string ("red"). Base color used to draw predictors. |
predictors.alpha |
numeric (1). Overall transparency for predictors (0 is transparent, 1 is opaque). |
predictors.arrows |
logical (TRUE). Should we draw arrows for predictors. |
predictors.arrows.color |
string (predictors.color). Specific color for predictor arrows. |
predictors.arrows.alpha |
numeric (predictors.alpha). Transparency for predictor arrows. |
predictors.labels |
logical (TRUE). Should we draw labels for predictors. |
predictors.labels.color |
string (predictors.color). Specific color for predictor labels. |
predictors.labels.alpha |
numeric (predictors.alpha). Transparency for predictor labels. |
predictors.labels.size |
numeric (labels.size). Specific size for predictor labels. |
predictors.labels.auto |
logical (labels.auto). Should predictor labels be aligned with arrows. |
covariates |
logical or array of characters or comma separated string (TRUE). Should we draw covariates and optionally which one (TRUE means all). |
covariates.color |
string ("black"). Base color used to draw covariates. |
covariates.alpha |
numeric (1). Overall transparency for covariates (0 is transparent, 1 is opaque). |
covariates.arrows |
logical (TRUE). Should we draw arrows for covariates. |
covariates.arrows.color |
string (covariates.color). Specific color for covariate arrows. |
covariates.arrows.alpha |
numeric (covariates.alpha). Transparency for covariate arrows. |
covariates.labels |
logical (TRUE). Should we draw labels for predictors. |
covariates.labels.color |
string (covariates.color). Specific color for predictor labels. |
covariates.labels.alpha |
numeric (covariates.alpha). Transparency for covariate labels. |
covariates.labels.size |
numeric (labels.size). Specific size for covariate labels. |
covariates.labels.auto |
logical (labels.auto). Should covariate labels be aligned with arrows. |
factor |
logical or character (FALSE). Should we draw a factor chosen among additionnal variables (TRUE mean first one). |
factor.points |
logical (TRUE). Should symbol be drawn for factors. |
factor.points.size |
numeric (4). Symbol size. |
factor.points.shape |
numeric (13). Point shape. |
factor.labels |
logical (TRUE). Should factor labels be drawn. |
factor.labels.color |
string ("black"). Color used to draw labels. |
factor.labels.size |
numeric (labels.size). Specific size for factor labels. |
## Not run: # setting parameters plot(genus.scglr) plot(genus.scglr, covariates=c("evi_1","pluvio_11")) plot(genus.scglr, covariates="evi_1,pluvio_11") plot(genus.scglr, predictors=TRUE) plot(genus.scglr, predictors=TRUE, pred.arrows=FALSE) # setting global style options(plot.SCGLR=list(predictors=TRUE, pred.arrows=FALSE)) plot(genus.scglr) # setting custom style myStyle <- list(predictors=TRUE, pred.arrows=FALSE) plot(genus.scglr, style=myStyle) ## End(Not run)
## Not run: # setting parameters plot(genus.scglr) plot(genus.scglr, covariates=c("evi_1","pluvio_11")) plot(genus.scglr, covariates="evi_1,pluvio_11") plot(genus.scglr, predictors=TRUE) plot(genus.scglr, predictors=TRUE, pred.arrows=FALSE) # setting global style options(plot.SCGLR=list(predictors=TRUE, pred.arrows=FALSE)) plot(genus.scglr) # setting custom style myStyle <- list(predictors=TRUE, pred.arrows=FALSE) plot(genus.scglr, style=myStyle) ## End(Not run)
dataGen gives the abundance of 8 common tree genera in the tropical moist forest of the Congo-Basin and 58 geo-referenced variables on 2615 8-by-8 km plots (observations). Each plot's data was obtained by aggregating the data measured on a variable number of previously sampled 0.5 ha sub-plots. Geo-referenced environmental variables were used to describe the physical factors as well as vegetation characteristics. On each plot, 34 physical factors were used pertaining the description of topography, geology, rainfall... Vegetation is characterized through 16-days enhanced vegetation index (EVI) data.
Y |
matrix giving the abundance of 8 common genera (matrix size = 2615*8). |
X |
matrix of 56 geo-referenced environmental variables (matrix size = 2615*56). |
AX |
matrix of 2 additionnal explanatory variables (geology and anthropic interference). |
offset |
sampled area. |
random |
forest concession id number. |
The use of this dataset for publication must make reference to the CoForChange project.
CoForChange project
S. Gourlet-Fleury et al. (2009–2014) CoForChange project: http://coforchange.cirad.fr/
C. Garcia et al. (2013–2015) CoForTips project: https://www.cofortips.org/
Genus gives the abundance of 27 common tree genera in the tropical moist forest of the Congo-Basin and 40 geo-referenced environmental variables on one thousand 8 by 8 km plots (observations). Each plot's data was obtained by aggregating the data measured on a variable number of previously sampled 0.5 ha sub-plots. Geo-referenced environmental variables were used to describe the physical factors as well as vegetation characteristics. 14 physical factors were used pertaining the description of topography, geology and rainfall of each plot. Vegetation is characterized through 16-days enhanced vegetation index (EVI) data.
gen1 to gen27 |
abundance of the 27 common genera. |
altitude |
above-sea level in meters. |
pluvio_yr |
mean annual rainfall. |
forest |
classified into seven classes. |
pluvio_1 to pluvio_12 |
monthly rainfalls. |
geology |
5-level geological substrate. |
evi_1 to evi_23 |
16-days enhanced vegetation indexes. |
lon and lat |
position of the plot centers. |
surface |
sampled area. |
The use of this dataset for publication must make reference to the CoForChange project.
CoForChange project
S. Gourlet-Fleury et al. (2009–2014) CoForChange project: http://coforchange.cirad.fr/
C. Garcia et al. (2013–2015) CoForTips project: https://www.cofortips.org/
genus2 gives the abundance of 15 common tree genera in the tropical moist forest of the Congo-Basin and 46 geo-referenced environmental variables on one thousand 8 by 8 km plots (observations). Each plot's data was obtained by aggregating the data measured on a variable number of previously sampled 0.5 ha sub-plots. Geo-referenced environmental variables were used to describe the physical factors as well as vegetation characteristics. 23 physical factors were used pertaining the description of topography, geology and rainfall of each plot. Vegetation is characterized through 16-days enhanced vegetation index (EVI) data.
gen1 to gen15 |
abundance of 15 common genera. |
evi_1 to evi_23 |
16-days enhanced vegetation indexes. |
MIR and NIR |
Middle-Infrared and Near-Infrared channels. |
pluvio_an |
mean annual rainfall. |
pluvio_1 to pluvio_12 |
monthly rainfalls. |
altitude |
above-sea level in meters. |
mois_sec_50 and mois_sec_50 |
??? |
CWD, awd and mcwd |
??? |
wetness |
??? |
center_x and center_y |
longitude and latitude of the plot centers. |
geology |
5-level geological substrate. |
inventory |
forest concession id number. |
surface |
sampled area. |
The use of this dataset for publication must make reference to the CoForChange project.
CoForChange project
S. Gourlet-Fleury et al. (2009–2014) CoForChange project: http://coforchange.cirad.fr/
C. Garcia et al. (2013–2015) CoForTips project: https://www.cofortips.org/
Calculates the components to predict all the response variables.
kCompRand( Y, family, size = NULL, X, AX = NULL, random, loffset = NULL, k, init.sigma = rep(1, ncol(Y)), init.comp = c("pca", "pls"), method = methodSR("vpi", l = 4, s = 1/2, maxiter = 1000, epsilon = 10^-6, bailout = 1000) )
kCompRand( Y, family, size = NULL, X, AX = NULL, random, loffset = NULL, k, init.sigma = rep(1, ncol(Y)), init.comp = c("pca", "pls"), method = methodSR("vpi", l = 4, s = 1/2, maxiter = 1000, epsilon = 10^-6, bailout = 1000) )
Y |
the matrix of random responses |
family |
a vector of character of the same length as the number of response variables: "bernoulli", "binomial", "poisson" or "gaussian" is allowed. |
size |
describes the number of trials for the binomial dependent variables: a (number of observations * number of binomial response variables) matrix is expected. |
X |
the matrix of the standardized explanatory variables |
AX |
the matrix of the additional explanatory variables |
random |
the vector giving the group of each unit (factor) |
loffset |
a matrix of size (number of observations * number of Poisson response variables) giving the log of the offset associated with each observation |
k |
number of components, default is one |
init.sigma |
a vector giving the initial values of the variance components, default is rep(1, ncol(Y)) |
init.comp |
a character describing how the components (loadings-vectors) are initialized in the PING algorithm: "pca" or "pls" is allowed. |
method |
Regularization criterion type: object of class "method.SCGLR"
built by function |
an object of the SCGLR class.
## Not run: library(SCGLR) # load sample data data(dataGen) k.opt=4 s.opt=0.1 l.opt=10 withRandom.opt=kCompRand(Y=dataGen$Y, family=rep("poisson", ncol(dataGen$Y)), X=dataGen$X, AX=dataGen$AX, random=dataGen$random, loffset=log(dataGen$offset), k=k.opt, init.sigma = rep(1, ncol(dataGen$Y)), init.comp = "pca", method=methodSR("vpi", l=l.opt, s=s.opt, maxiter=1000, epsilon=10^-6, bailout=1000)) plot(withRandom.opt, pred=TRUE, plane=c(1,2), title="Component plane (1,2)", threshold=0.7, covariates.alpha=0.4, predictors.labels.size=6) ## End(Not run)
## Not run: library(SCGLR) # load sample data data(dataGen) k.opt=4 s.opt=0.1 l.opt=10 withRandom.opt=kCompRand(Y=dataGen$Y, family=rep("poisson", ncol(dataGen$Y)), X=dataGen$X, AX=dataGen$AX, random=dataGen$random, loffset=log(dataGen$offset), k=k.opt, init.sigma = rep(1, ncol(dataGen$Y)), init.comp = "pca", method=methodSR("vpi", l=l.opt, s=s.opt, maxiter=1000, epsilon=10^-6, bailout=1000)) plot(withRandom.opt, pred=TRUE, plane=c(1,2), title="Component plane (1,2)", threshold=0.7, covariates.alpha=0.4, predictors.labels.size=6) ## End(Not run)
Regularization criterion types
methodSR( phi = "vpi", l = 1, s = 1/2, maxiter = 1000, epsilon = 1e-06, bailout = 10 )
methodSR( phi = "vpi", l = 1, s = 1/2, maxiter = 1000, epsilon = 1e-06, bailout = 10 )
phi |
character string describing structural relevance used in the regularization process. Allowed values are "vpi" for Variable Powered Inertia and "cv" for Component Variance. Default to "vpi". |
l |
is an integer argument (>1) tuning the importance of variable bundle locality. |
s |
is a numeric argument (in [0,1]) tuning the strength of structural relevance with respect to goodness of fit. |
maxiter |
integer for maximum number of iterations of |
epsilon |
positive convergence threshold |
bailout |
integer argument |
Helper function for building multivariate scglr formula.
NOTE: Interactions involving factors are not allowed for now.
For interactions between two quantitative variables, use I(x*y)
as usual.
multivariateFormula(Y, X = NULL, ..., A = NULL, additional = NULL, data = NULL)
multivariateFormula(Y, X = NULL, ..., A = NULL, additional = NULL, data = NULL)
Y |
a formula or a vector of character containing the names of the dependent variables. |
X |
a vector of character containing the names of the covariates (X) involved in the components or a list of it. |
... |
additional groups of covariates (theme) |
A |
a vector of character containing the names of the additional covariates. |
additional |
logical (if A is not provided, should we consider last X to be additional covariates) |
data |
a data frame against which formula's variable will be checked |
If Y is given as a formula, groups of covariates must be separated by |
(pipes). To declare last
group as additional covariates, one can use ||
(double pipes) as last group separator or set
additional
parameter as TRUE
.
an object of class MultivariateFormula, Formula, formula
with additional attributes: Y, X, A, X_vars, Y_vars,A_vars,XA_vars, YXA_vars, additional
## Not run: # build multivariate formula ny <- c("y1","y2") nx1 <- c("x11","x12") nx2 <- c("x21","x22") nadd <- c("add1","add2") form <- multivariateFormula(ny,nx1,nx2,nadd,additional=T) form2 <- multivariateFormula(ny,list(nx1,nx2,nadd),additional=T) form3 <- multivariateFormula(ny,list(nx1,nx2),A=nadd) form4 <- multivariateFormula(y1+y2~x11+x12|x21+x22||add1+add2) # Print formulas form form2 form3 ## End(Not run)
## Not run: # build multivariate formula ny <- c("y1","y2") nx1 <- c("x11","x12") nx2 <- c("x21","x22") nadd <- c("add1","add2") form <- multivariateFormula(ny,nx1,nx2,nadd,additional=T) form2 <- multivariateFormula(ny,list(nx1,nx2,nadd),additional=T) form3 <- multivariateFormula(ny,list(nx1,nx2),A=nadd) form4 <- multivariateFormula(y1+y2~x11+x12|x21+x22||add1+add2) # Print formulas form form2 form3 ## End(Not run)
Pairwise scglr plot on components
## S3 method for class 'SCGLR' pairs(x, ..., nrow = NULL, ncol = NULL, components = NULL)
## S3 method for class 'SCGLR' pairs(x, ..., nrow = NULL, ncol = NULL, components = NULL)
x |
object of class 'SCGLR', usually a result of running |
... |
optionally, further arguments forwarded to |
nrow |
number of rows of the grid layout. |
ncol |
number of columns of the grid layout. |
components |
vector of integers selecting components to plot (default is all components). |
an object of class ggplot
.
For pairs application see examples in plot.SCGLR
SCGLR generic plot
## S3 method for class 'SCGLR' plot(x, ..., style = getOption("plot.SCGLR"), plane = c(1, 2))
## S3 method for class 'SCGLR' plot(x, ..., style = getOption("plot.SCGLR"), plane = c(1, 2))
x |
an object from SCGLR class. |
... |
optional arguments (see customize). |
style |
named list of values used to customize the plot (see customize) |
plane |
a size-2 vector (or string with separator) indicating which components are plotted (eg: c(1,2) or "1,2" or "1/2"). |
an object of class ggplot
.
## Not run: library(SCGLR) # load sample data data(genus) # get variable names from dataset n <- names(genus) ny <- n[grep("^gen",n)] # Y <- names that begins with "gen" nx <- n[-grep("^gen",n)] # X <- remaining names # remove "geology" and "surface" from nx # as surface is offset and we want to use geology as additional covariate nx <-nx[!nx%in%c("geology","surface")] # build multivariate formula # we also add "lat*lon" as computed covariate form <- multivariateFormula(ny,c(nx,"I(lat*lon)"),c("geology")) # define family fam <- rep("poisson",length(ny)) genus.scglr <- scglr(formula=form,data = genus,family=fam, K=4, offset=genus$surface) summary(genus.scglr) barplot(genus.scglr) plot(genus.scglr) plot(genus.scglr, predictors=TRUE, factor=TRUE) pairs(genus.scglr) ## End(Not run)
## Not run: library(SCGLR) # load sample data data(genus) # get variable names from dataset n <- names(genus) ny <- n[grep("^gen",n)] # Y <- names that begins with "gen" nx <- n[-grep("^gen",n)] # X <- remaining names # remove "geology" and "surface" from nx # as surface is offset and we want to use geology as additional covariate nx <-nx[!nx%in%c("geology","surface")] # build multivariate formula # we also add "lat*lon" as computed covariate form <- multivariateFormula(ny,c(nx,"I(lat*lon)"),c("geology")) # define family fam <- rep("poisson",length(ny)) genus.scglr <- scglr(formula=form,data = genus,family=fam, K=4, offset=genus$surface) summary(genus.scglr) barplot(genus.scglr) plot(genus.scglr) plot(genus.scglr, predictors=TRUE, factor=TRUE) pairs(genus.scglr) ## End(Not run)
SCGLR generic plot for themes
## S3 method for class 'SCGLRTHM' plot(x, ...)
## S3 method for class 'SCGLRTHM' plot(x, ...)
x |
object of class 'SCGLRTHM', usually a result of running |
... |
see SCGLR plot method |
an object of class ggplot
.
Calculates the components to predict all the dependent variables.
scglr( formula, data, family, K = 1, size = NULL, weights = NULL, offset = NULL, subset = NULL, na.action = na.omit, crit = list(), method = methodSR() )
scglr( formula, data, family, K = 1, size = NULL, weights = NULL, offset = NULL, subset = NULL, na.action = na.omit, crit = list(), method = methodSR() )
formula |
an object of class |
data |
a data frame to be modeled. |
family |
a vector of character of the same length as the number of dependent variables: "bernoulli", "binomial", "poisson" or "gaussian" is allowed. |
K |
number of components, default is one. |
size |
describes the number of trials for the binomial dependent variables. A (number of statistical units * number of binomial dependent variables) matrix is expected. |
weights |
weights on individuals (not available for now) |
offset |
used for the poisson dependent variables. A vector or a matrix of size: number of observations * number of Poisson dependent variables is expected. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when the data contain NAs. The default is set to |
crit |
a list of two elements : maxit and tol, describing respectively the maximum number of iterations and the tolerance convergence criterion for the Fisher scoring algorithm. Default is set to 50 and 10e-6 respectively. |
method |
structural relevance criterion. Object of class "method.SCGLR"
built by |
an object of the SCGLR class.
The function summary
(i.e., summary.SCGLR
) can be used to obtain or print a summary of the results.
An object of class "SCGLR
" is a list containing following components:
u |
matrix of size (number of regressors * number of components), contains the component-loadings, i.e. the coefficients of the regressors in the linear combination giving each component. |
comp |
matrix of size (number of statistical units * number of components) having the components as column vectors. |
compr |
matrix of size (number of statistical units * number of components) having the standardized components as column vectors. |
gamma |
list of length number of dependant variables. Each element is a matrix of coefficients, standard errors, z-values and p-values. |
beta |
matrix of size (number of regressors + 1 (intercept) * number of dependent variables), contains the coefficients of the regression on the original regressors X. |
lin.pred |
data.frame of size (number of statistical units * number of dependent variables), the fitted linear predictor. |
xFactors |
data.frame containing the nominal regressors. |
xNumeric |
data.frame containing the quantitative regressors. |
inertia |
matrix of size (number of components * 2), contains the percentage and cumulative percentage of the overall regressors' variance, captured by each component. |
logLik |
vector of length (number of dependent variables), gives the likelihood of the model of each |
deviance.null |
vector of length (number of dependent variables), gives the deviance of the null model of each |
deviance.residual |
vector of length (number of dependent variables), gives the deviance of the model of each |
Bry X., Trottier C., Verron T. and Mortier F. (2013) Supervised Component Generalized Linear Regression using a PLS-extension of the Fisher scoring algorithm. Journal of Multivariate Analysis, 119, 47-60.
## Not run: library(SCGLR) # load sample data data(genus) # get variable names from dataset n <- names(genus) ny <- n[grep("^gen",n)] # Y <- names that begins with "gen" nx <- n[-grep("^gen",n)] # X <- remaining names # remove "geology" and "surface" from nx # as surface is offset and we want to use geology as additional covariate nx <-nx[!nx%in%c("geology","surface")] # build multivariate formula # we also add "lat*lon" as computed covariate form <- multivariateFormula(ny,c(nx,"I(lat*lon)"),A=c("geology")) # define family fam <- rep("poisson",length(ny)) genus.scglr <- scglr(formula=form,data = genus,family=fam, K=4, offset=genus$surface) summary(genus.scglr) ## End(Not run)
## Not run: library(SCGLR) # load sample data data(genus) # get variable names from dataset n <- names(genus) ny <- n[grep("^gen",n)] # Y <- names that begins with "gen" nx <- n[-grep("^gen",n)] # X <- remaining names # remove "geology" and "surface" from nx # as surface is offset and we want to use geology as additional covariate nx <-nx[!nx%in%c("geology","surface")] # build multivariate formula # we also add "lat*lon" as computed covariate form <- multivariateFormula(ny,c(nx,"I(lat*lon)"),A=c("geology")) # define family fam <- rep("poisson",length(ny)) genus.scglr <- scglr(formula=form,data = genus,family=fam, K=4, offset=genus$surface) summary(genus.scglr) ## End(Not run)
Function that fits and selects the number of component by cross-validation.
scglrCrossVal( formula, data, family, K = 1, folds = 10, type = "mspe", size = NULL, offset = NULL, na.action = na.omit, crit = list(), method = methodSR(), nfolds, mc.cores )
scglrCrossVal( formula, data, family, K = 1, folds = 10, type = "mspe", size = NULL, offset = NULL, na.action = na.omit, crit = list(), method = methodSR(), nfolds, mc.cores )
formula |
an object of class "Formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. |
data |
the data frame to be modeled. |
family |
a vector of character of length q specifying the distributions of the responses. Bernoulli, binomial, poisson and gaussian are allowed. |
K |
number of components, default is one. |
folds |
number of folds, default is 10. Although folds can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets. folds can also be provided as a vector (same length as data) of fold identifiers. |
type |
loss function to use for cross-validation. Currently six options are available depending on whether the responses are of the same distribution family. If the responses are all bernoulli distributed, then the prediction performance may be measured through the area under the ROC curve: type = "auc" In any other case one can choose among the following five options ("likelihood","aic","aicc","bic","mspe"). |
size |
specifies the number of trials of the binomial variables included in the model. A (n*qb) matrix is expected for qb binomial variables. |
offset |
used for the poisson dependent variables. A vector or a matrix of size: number of observations * number of Poisson dependent variables is expected. |
na.action |
a function which indicates what should happen when the data contain NAs. The default is set to the |
crit |
a list of two elements : maxit and tol, describing respectively the maximum number of iterations and the tolerance convergence criterion for the Fisher scoring algorithm. Default is set to 50 and 10e-6 respectively. |
method |
Regularization criterion type. Object of class "method.SCGLR"
built by |
nfolds |
deprecated. Use |
mc.cores |
deprecated |
a matrix containing the criterion values for each response (rows) and each number of components (columns).
Bry X., Trottier C., Verron T. and Mortier F. (2013) Supervised Component Generalized Linear Regression using a PLS-extension of the Fisher scoring algorithm. Journal of Multivariate Analysis, 119, 47-60.
## Not run: library(SCGLR) # load sample data data(genus) # get variable names from dataset n <- names(genus) ny <- n[grep("^gen",n)] # Y <- names that begins with "gen" nx <- n[-grep("^gen",n)] # X <- remaining names # remove "geology" and "surface" from nx # as surface is offset and we want to use geology as additional covariate nx <-nx[!nx%in%c("geology","surface")] # build multivariate formula # we also add "lat*lon" as computed covariate form <- multivariateFormula(ny,c(nx,"I(lat*lon)"),A=c("geology")) # define family fam <- rep("poisson",length(ny)) # cross validation genus.cv <- scglrCrossVal(formula=form, data=genus, family=fam, K=12, offset=genus$surface) # find best K mean.crit <- colMeans(log(genus.cv)) #plot(mean.crit, type="l") ## End(Not run)
## Not run: library(SCGLR) # load sample data data(genus) # get variable names from dataset n <- names(genus) ny <- n[grep("^gen",n)] # Y <- names that begins with "gen" nx <- n[-grep("^gen",n)] # X <- remaining names # remove "geology" and "surface" from nx # as surface is offset and we want to use geology as additional covariate nx <-nx[!nx%in%c("geology","surface")] # build multivariate formula # we also add "lat*lon" as computed covariate form <- multivariateFormula(ny,c(nx,"I(lat*lon)"),A=c("geology")) # define family fam <- rep("poisson",length(ny)) # cross validation genus.cv <- scglrCrossVal(formula=form, data=genus, family=fam, K=12, offset=genus$surface) # find best K mean.crit <- colMeans(log(genus.cv)) #plot(mean.crit, type="l") ## End(Not run)
Calculates the components to predict all the dependent variables.
scglrTheme( formula, data, H, family, size = NULL, weights = NULL, offset = NULL, subset = NULL, na.action = na.omit, crit = list(), method = methodSR(), st = FALSE )
scglrTheme( formula, data, H, family, size = NULL, weights = NULL, offset = NULL, subset = NULL, na.action = na.omit, crit = list(), method = methodSR(), st = FALSE )
formula |
an object of class " |
data |
data frame. |
H |
vector of R integer. Number of components to keep for each theme |
family |
a vector of character of the same length as the number of dependent variables: "bernoulli", "binomial", "poisson" or "gaussian" is allowed. |
size |
describes the number of trials for the binomial dependent variables. A (number of statistical units * number of binomial dependent variables) matrix is expected. |
weights |
weights on individuals (not available for now) |
offset |
used for the poisson dependent variables. A vector or a matrix of size: number of observations * number of Poisson dependent variables is expected. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when the data contain NAs. The default is set to |
crit |
a list of two elements : maxit and tol, describing respectively the maximum number of iterations and the tolerance convergence criterion for the Fisher scoring algorithm. Default is set to 50 and 10e-6 respectively. |
method |
structural relevance criterion. Object of class "method.SCGLR"
built by |
st |
logical (FALSE) theme build and fit order. TRUE means random, FALSE means sequential (T1, ..., Tr) |
Models for theme are specified symbolically.
A model as the form response ~ terms
where response
is the numeric response vector and terms is a series of R themes composed of
predictors.
Themes are separated by "|" (pipe) and are composed. ... Y1+Y2+... ~ X11+X12+...+X1_ | X21+X22+... | ...+X1_+... | A1+A2+...
See multivariateFormula
.
a list of SCGLRTHM class. Each element is a SCGLR object
## Not run: library(SCGLR) # load sample data data(genus) # get variable names from dataset n <- names(genus) n <-n[!n%in%c("geology","surface","lon","lat","forest","altitude")] ny <- n[grep("^gen",n)] # Y <- names that begins with "gen" nx1 <- n[grep("^evi",n)] # X <- remaining names nx2 <- n[-c(grep("^evi",n),grep("^gen",n))] form <- multivariateFormula(ny,nx1,nx2,A=c("geology")) fam <- rep("poisson",length(ny)) testthm <-scglrTheme(form,data=genus,H=c(2,2),family=fam,offset = genus$surface) plot(testthm) ## End(Not run)
## Not run: library(SCGLR) # load sample data data(genus) # get variable names from dataset n <- names(genus) n <-n[!n%in%c("geology","surface","lon","lat","forest","altitude")] ny <- n[grep("^gen",n)] # Y <- names that begins with "gen" nx1 <- n[grep("^evi",n)] # X <- remaining names nx2 <- n[-c(grep("^evi",n),grep("^gen",n))] form <- multivariateFormula(ny,nx1,nx2,A=c("geology")) fam <- rep("poisson",length(ny)) testthm <-scglrTheme(form,data=genus,H=c(2,2),family=fam,offset = genus$surface) plot(testthm) ## End(Not run)
Perform component selection by cross-validation backward approach
scglrThemeBackward( formula, data, H, family, size = NULL, weights = NULL, offset = NULL, na.action = na.omit, crit = list(), method = methodSR(), folds = 10, type = "mspe", st = FALSE )
scglrThemeBackward( formula, data, H, family, size = NULL, weights = NULL, offset = NULL, na.action = na.omit, crit = list(), method = methodSR(), folds = 10, type = "mspe", st = FALSE )
formula |
an object of class " |
data |
data frame. |
H |
vector of R integer. Number of components to keep for each theme |
family |
a vector of character of the same length as the number of dependent variables: "bernoulli", "binomial", "poisson" or "gaussian" is allowed. |
size |
describes the number of trials for the binomial dependent variables. A (number of statistical units * number of binomial dependent variables) matrix is expected. |
weights |
weights on individuals (not available for now) |
offset |
used for the poisson dependent variables. A vector or a matrix of size: number of observations * number of Poisson dependent variables is expected. |
na.action |
a function which indicates what should happen when the data contain NAs. The default is set to |
crit |
a list of two elements : maxit and tol, describing respectively the maximum number of iterations and the tolerance convergence criterion for the Fisher scoring algorithm. Default is set to 50 and 10e-6 respectively. |
method |
structural relevance criterion. Object of class "method.SCGLR"
built by |
folds |
number of folds - default is 10. Although folds can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets. Smallest value allowable is folds=2. folds can also be provided as a vector (same length as data) of fold identifiers. |
type |
loss function to use for cross-validation. Currently six options are available depending on whether the responses are of the same distribution family. If the responses are all bernoulli distributed, then the prediction performance may be measured through the area under the ROC curve: type = "auc" In any other case one can choose among the following five options ("likelihood","aic","aicc","bic","mspe"). |
st |
logical (FALSE) theme build and fit order. TRUE means random, FALSE means sequential (T1, ..., Tr) |
Models for theme are specified symbolically.
A model as the form response ~ terms
where response
is the numeric response vector and terms is a series of R themes composed of predictors.
Themes are separated by "|" (pipe) and are composed.
y1 + y2 + ... ~ x11 + x12 + ... + x1_ | x21 + x22 + ... | ... + x1_ + ... | a1 + a2 + ...
See multivariateFormula
.
a list containing the path followed along the selection process, the associated mean square predictor error and the best configuration.
## Not run: library(SCGLR) # load sample data data(genus) # get variable names from dataset n <- names(genus) n <- n[!n %in% c("geology","surface","lon","lat","forest","altitude")] ny <- n[grep("^gen",n)] # Y <- names that begins with "gen" nx1 <- n[grep("^evi",n)] # X <- remaining names nx2 <- n[-c(grep("^evi",n),grep("^gen",n))] form <- multivariateFormula(ny,nx1,nx2,A=c("geology")) fam <- rep("poisson",length(ny)) testcv <- scglrThemeBackward(form,data=genus,H=c(2,2),family=fam,offset = genus$surface,folds=3) # Cross-validation pathway testcv$H_path # Plot criterion plot(testcv$cv_path) # Best combination testcv$H_best ## End(Not run)
## Not run: library(SCGLR) # load sample data data(genus) # get variable names from dataset n <- names(genus) n <- n[!n %in% c("geology","surface","lon","lat","forest","altitude")] ny <- n[grep("^gen",n)] # Y <- names that begins with "gen" nx1 <- n[grep("^evi",n)] # X <- remaining names nx2 <- n[-c(grep("^evi",n),grep("^gen",n))] form <- multivariateFormula(ny,nx1,nx2,A=c("geology")) fam <- rep("poisson",length(ny)) testcv <- scglrThemeBackward(form,data=genus,H=c(2,2),family=fam,offset = genus$surface,folds=3) # Cross-validation pathway testcv$H_path # Plot criterion plot(testcv$cv_path) # Best combination testcv$H_best ## End(Not run)
Screeplot of percent of overall X variance captured by component
## S3 method for class 'SCGLR' screeplot(x, ...)
## S3 method for class 'SCGLR' screeplot(x, ...)
x |
object of class 'SCGLR', usually a result of running |
... |
optional arguments. |
an object of class ggplot
.
For screeplot application see examples in plot.SCGLR
.
Screeplot of percent of overall X variance captured by component by theme
## S3 method for class 'SCGLRTHM' screeplot(x, ...)
## S3 method for class 'SCGLRTHM' screeplot(x, ...)
x |
object of class 'SCGLRTHM', usually a result of running |
... |
optional arguments. |
an object of class ggplot
.