Goal of this markdown: Walk through an example of how to use AIC to rank models from my field work where I look at the impact of envirnmental factors. I use the same approach for my experiment so I won’t run that code here but I’ll talk about it a little bit incase that’s useful for you. At the end I’ll show you how I graph predicted values from the model from my phenology model list. For my data we’re using mixed linear models (explained below) but I believe this approach works for linear models as well, you would just remove the grouping term.
Mixed linear models vs Linear Models
- Mixed model is the best approach because for my field data since my data is “clustered/grouped” by site but the sites are random representatives of Central SFE (grouped = random). Code for random term: (1 | term) (Don’t include if you’re not doign mixed model approach). I also use it for my experimental data and group it by “trial run” to account for any variation between the dates. - NOTE: for transformed models, denote term with “I.” This proved to be critical when graphing predicted models later. For example, lmer(sqrt(no.fuc.q) ~ week_nr + I(week_nr^2) + (1|site), data=all).
Phenology in field work:
- For “time” variable I use number of weeks since the first survey (with the first survey week set at 0) which will allow me to convert it to a numneric variable and use quadratic terms as needed. This also allows the model to interpolate back to a defined “0”.
Steps:
- What are your research questions and hypotheses? This is key in being strategic in the models you chose to run. Don’t list every posible combination of your variables, think about ecological mechanisms and the literature you’ve read to determine what you’re really interested in. This also includes thinking about what interactions terms you’ll include or not. I’ll list mine models below.
- Set up data. What variables are you interested in? How should you best set up your data? For me, I have my environemtal data summarized with daily ranges and “extreme” events. For each field survey I look at the effects of 30 days of field data 60 days before (not looking at the 30 days leading up to the field survey). When running models, your variables need to be summarized on one line in your data frame because when you run the model it’ll be looking at column names. Take time to make sure your data is set up properly for your research questions.
- Read in and set up data. For me this included creating a new column with “week number.”
- Run AIC with your list of models - Read and interpret output - Examine top ranked model(s) closer
My list of models I am addressing research question #3:
Is variation in salinity, air temperature, or pH associated with changes in abundances or reproductive effort of F. distichus populations in SFE?
After looking at my phenology models, we are more interested in seeing how juvenile abundance changes due to environmental factors. Reviewing the literature, we decided we are interested in extreme events and daily ranges. “Juv” below refers to juvenile density; if I had multiple factors measuring juvenile abundance I would use the same approach multiple times with a different variable (juvenile density with this list of explanatory factors, juvenile x in a seperate list and ranking). Here is the list was came up with:
Helpful links:
- lme4 package: https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf
- mixed movels: https://m-clark.github.io/mixed-models-with-R/introduction.html -https://ourcodingclub.github.io/tutorials/mixed-models/ -http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#model-definition
-https://ademos.people.uic.edu/Chapter18.html -https://www.statology.org/aic-in-r/ Understanding the output: https://campus.datacamp.com/courses/hierarchical-and-mixed-effects-models-in-r/linear-mixed-effect-models?ex=7 -https://stat.ethz.ch/pipermail/r-sig-mixed-models/2011q1/subject.html#12692
Set up.
rm(list=ls())
#AIC
library(AICcmodavg)
library(plm) #for random factors
library(lme4)
library(car)
#Table
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(insight)
library(tibble)
#extra- I'm bad at keeping track of what packages I use
library(tidyverse)
library(ggpubr)
library(scales)
library(chron)
library(plotly)
library(taRifx)
library(aweek)
library(easypackages)
## Error in get(genname, envir = envir) : object 'testthat_print' not found
library(renv)
library(here)
library(ggthemes)
library(gridExtra)
library(patchwork)
library(tidyquant)
library(recipes)
library(cranlogs)
library(knitr)
library(openair)
library(xts)
library(dplyr)
Read in data-
Envirnomental data 30 days leading up to survey day (excluding day of survey) and adult thalli density from the field survey one month before. Juvenile density is the mean per survey per field site.
#read in data
one.month<-read.csv(
"https://raw.githubusercontent.com/Cmwegener/thesis/master/data/environment_field/envi.field.all.one.month.lag.csv",
header = TRUE
)
one.month$date<-as.Date(one.month$date, format = c("%Y-%m-%d"))
Convert date to number of weeks of surveys. Starting at week 0 being the first survey. This will make it a contiuous variable and then I can add a quadratic term to it. This is different that “week number of the year” so I have to use a customized approach. I was able to find a similar question online: https://stackoverflow.com/questions/54406803/assigning-custom-week-number-in-r/54407091, https://stackoverflow.com/questions/19780415/lubridate-week-to-find-consecutive-week-number-for-multi-year-periods
# subtracts minimun week number, +0 sets what week you want to start it on
one.month$week_nr <- (interval(min(one.month$date), one.month$date) %/% weeks(1)) + 0
AIC
#list of models
m1 <- lm(no.small.fuc.q ~ lag.no.large.fuc.q + daily.sal.range + max.daily.air.gt26 + daily.ph.range + lag.no.large.fuc.q:max.daily.air.gt26, data=one.month)
m2 <- lm(no.small.fuc.q ~ lag.no.large.fuc.q + max.daily.sal.lt10 + max.daily.air.gt26 + max.daily.ph.gt8 + lag.no.large.fuc.q:max.daily.air.gt26, data=one.month)
m3 <- lm(no.small.fuc.q ~ lag.no.large.fuc.q + daily.sal.range + max.daily.air.gt26 + lag.no.large.fuc.q:max.daily.air.gt26, data=one.month)
m4 <- lm(no.small.fuc.q ~ lag.no.large.fuc.q + max.daily.sal.lt10 + max.daily.air.gt26 + lag.no.large.fuc.q:max.daily.air.gt26, data=one.month)
m5 <- lm(no.small.fuc.q ~ lag.no.large.fuc.q + max.daily.air.gt26 + lag.no.large.fuc.q:max.daily.air.gt26, data=one.month)
m6 <- lm(no.small.fuc.q ~ lag.no.large.fuc.q + daily.sal.range + max.daily.air.gt26, data=one.month)
m7 <- lm(no.small.fuc.q ~ max.daily.sal.lt10 + max.daily.ph.gt8 + max.daily.air.gt26, data=one.month)
m8 <- lm(no.small.fuc.q ~ daily.sal.range + daily.ph.range + max.daily.air.gt26, data=one.month)
m9 <- lm(no.small.fuc.q ~ lag.no.large.fuc.q + max.daily.air.gt26, data=one.month)
m10 <- lm(no.small.fuc.q ~ daily.sal.range + daily.ph.range, data=one.month)
m11 <- lm(no.small.fuc.q ~ max.daily.sal.lt10 + max.daily.ph.gt8, data=one.month)
m12 <- lm(no.small.fuc.q ~ max.daily.sal.lt10 + max.daily.air.gt26, data=one.month)
#define list of models
models <- list(m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11,m12)
#specify model names- just keeping them the same since I have too many. Here you can change the names, in the same order as defined above, to something like "salinity", "temp", ect
mod.names <- c('m1','m2','m3','m4', 'm5','m6','m7','m8', 'm9','m10', 'm11','m12')
#calculate AIC of each model
aictab(cand.set = models, modnames = mod.names)
##
## Model selection based on AICc:
##
## K AICc Delta_AICc AICcWt Cum.Wt LL
## m1 7 395.48 0.00 0.96 0.96 -188.93
## m2 7 401.78 6.30 0.04 1.00 -192.09
## m8 5 424.88 29.40 0.00 1.00 -206.61
## m10 4 425.32 29.84 0.00 1.00 -208.12
## m11 4 430.60 35.12 0.00 1.00 -210.76
## m7 5 433.06 37.58 0.00 1.00 -210.70
## m6 5 515.41 119.92 0.00 1.00 -252.06
## m3 6 517.94 122.46 0.00 1.00 -252.06
## m9 4 521.93 126.45 0.00 1.00 -256.55
## m5 5 524.24 128.75 0.00 1.00 -256.48
## m4 6 526.78 131.30 0.00 1.00 -256.48
## m12 4 569.63 174.15 0.00 1.00 -280.43
From the output I see model 1 is ranked the heightest (lm(no.small.fuc.q ~ lag.no.large.fuc.q + daily.sal.range + max.daily.air.gt26 + daily.ph.range + lag.no.large.fuc.q:max.daily.air.gt26, data=one.month)). Going to take a closer look at that model.
m1 <- lm(no.small.fuc.q ~ lag.no.large.fuc.q + daily.sal.range + max.daily.air.gt26 + daily.ph.range + lag.no.large.fuc.q:max.daily.air.gt26, data=one.month)
Fitting and analyzing the top ranked model.
m1 <- lm(no.small.fuc.q ~ lag.no.large.fuc.q + daily.sal.range + max.daily.air.gt26 + daily.ph.range + lag.no.large.fuc.q:max.daily.air.gt26, data=one.month)
plot(m1)
m1
##
## Call:
## lm(formula = no.small.fuc.q ~ lag.no.large.fuc.q + daily.sal.range +
## max.daily.air.gt26 + daily.ph.range + lag.no.large.fuc.q:max.daily.air.gt26,
## data = one.month)
##
## Coefficients:
## (Intercept) lag.no.large.fuc.q
## 58.753199 1.000281
## daily.sal.range max.daily.air.gt26
## -4.504252 -0.790638
## daily.ph.range lag.no.large.fuc.q:max.daily.air.gt26
## -1.891923 0.002399
summary(m1)
##
## Call:
## lm(formula = no.small.fuc.q ~ lag.no.large.fuc.q + daily.sal.range +
## max.daily.air.gt26 + daily.ph.range + lag.no.large.fuc.q:max.daily.air.gt26,
## data = one.month)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.061 -24.344 -7.122 12.649 104.767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.753199 20.286896 2.896 0.00666 **
## lag.no.large.fuc.q 1.000281 1.342969 0.745 0.46165
## daily.sal.range -4.504252 1.774704 -2.538 0.01605 *
## max.daily.air.gt26 -0.790638 0.882614 -0.896 0.37685
## daily.ph.range -1.891923 3.429397 -0.552 0.58489
## lag.no.large.fuc.q:max.daily.air.gt26 0.002399 0.094643 0.025 0.97993
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.42 on 33 degrees of freedom
## (22 observations deleted due to missingness)
## Multiple R-squared: 0.2047, Adjusted R-squared: 0.0842
## F-statistic: 1.699 on 5 and 33 DF, p-value: 0.1624
anova(m1)
## Analysis of Variance Table
##
## Response: no.small.fuc.q
## Df Sum Sq Mean Sq F value Pr(>F)
## lag.no.large.fuc.q 1 1580 1580.1 1.4150 0.24271
## daily.sal.range 1 5706 5706.1 5.1099 0.03051 *
## max.daily.air.gt26 1 1857 1856.8 1.6628 0.20620
## daily.ph.range 1 341 340.9 0.3053 0.58429
## lag.no.large.fuc.q:max.daily.air.gt26 1 1 0.7 0.0006 0.97993
## Residuals 33 36850 1116.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(m1)
## 2.5 % 97.5 %
## (Intercept) 17.4791983 100.0271992
## lag.no.large.fuc.q -1.7320093 3.7325716
## daily.sal.range -8.1149154 -0.8935885
## max.daily.air.gt26 -2.5863298 1.0050534
## daily.ph.range -8.8690826 5.0852374
## lag.no.large.fuc.q:max.daily.air.gt26 -0.1901534 0.1949505
Table
#simple table
tab_model(m1)
| no.small.fuc.q | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 58.75 | 17.48 – 100.03 | 0.007 |
| lag.no.large.fuc.q | 1.00 | -1.73 – 3.73 | 0.462 |
| daily.sal.range | -4.50 | -8.11 – -0.89 | 0.016 |
| max.daily.air.gt26 | -0.79 | -2.59 – 1.01 | 0.377 |
| daily.ph.range | -1.89 | -8.87 – 5.09 | 0.585 |
|
lag.no.large.fuc.q * max.daily.air.gt26 |
0.00 | -0.19 – 0.19 | 0.980 |
| Observations | 39 | ||
| R2 / R2 adjusted | 0.205 / 0.084 | ||
#modified table
tab_model(m1, #models to table
show.ci = 0.95, #show 95% CI
show.se = TRUE, #show standar error
show.stat = FALSE,
emph.p = TRUE, #bold "significant" p values
df.method = "kr",
p.style = "numeric_stars",
title = "Linear model for small thalli density",
linebreak = TRUE,
dv.labels = c("small thalli density" #change name of column
),
pred.labels = c( #change names of rows
"(Intercept)",
"large thalli density",
"daily salinity range",
"number of days with a maximum daily air temperature greater than 26C",
"daily pH range",
"large thalli density : number of days with a maximum daily air temperature greater than 26C"
),
CSS = css_theme("cells")
)
| small thalli density | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | CI | p |
| (Intercept) | 58.75 ** | 20.29 | 17.48 – 100.03 | 0.007 |
| large thalli density | 1.00 | 1.34 | -1.73 – 3.73 | 0.462 |
| daily salinity range | -4.50 * | 1.77 | -8.11 – -0.89 | 0.016 |
| number of days with a maximum daily air temperature greater than 26C | -0.79 | 0.88 | -2.59 – 1.01 | 0.377 |
| daily pH range | -1.89 | 3.43 | -8.87 – 5.09 | 0.585 |
| large thalli density : number of days with a maximum daily air temperature greater than 26C | 0.00 | 0.09 | -0.19 – 0.19 | 0.980 |
| Observations | 39 | |||
| R2 / R2 adjusted | 0.205 / 0.084 | |||
|
||||
Graphing the predicted values of the models and the predicted values of the model per site
I don’t do this with my model list above/my environmental data but I thought this might be helpful for you. I’ll walk though an example with my phenology model.
2.4 Does small thalli density change over time? sqrt(no.small.fuc.q ~ week_nr + week_nr^2 +((1|site)
Added position dodge to the site trend lines because they were stacked (https://stackoverflow.com/questions/10866047/jitter-geom-line)
#packages, to use ggpredict
library(ggeffects)
#read in data
all<-read.csv(
"https://raw.githubusercontent.com/Cmwegener/thesis/master/data/environment_field/envi.field.all.csv",
header = TRUE
)
all$date<-as.Date(all$date, format = c("%Y-%m-%d"))
#add week number
all$week_nr <- (interval(min(all$date), all$date) %/% weeks(1)) + 0
#remove rows with NA
no.na<-all[!is.na(all$no.small.fuc.q), ]
#model-The I is important here!
mlm2.4<-lmer(sqrt(no.small.fuc.q) ~ week_nr + I(week_nr^2) + (1|site), data=no.na)
## Warning: Some predictor variables are on very different scales: consider
## rescaling
# Extract the prediction data frame
pred.mm <- ggpredict(mlm2.4, terms = "week_nr [all]")
#add predicted values from the model to the dataset
no.na$predlm = predict(mlm2.4)
#assign colors to models (needs work)
COLORS <- c(group="black", pc.cc = "#66C2A5", nd.eos ="#FC8D62",
by.rb = "#8DA0CB" , hs.fp = "#E78AC3")
#plot
ggplot() +
geom_ribbon(data=pred.mm,aes(x = x,
ymin = predicted - std.error,
ymax = predicted + std.error
),
fill = "lightgrey",
alpha = 0.5
) +
geom_line(data= pred.mm, aes(x = x, y = predicted), size=1.5) +
geom_line(data= no.na, aes(x= week_nr, y = predlm, colour = site), size = 1) +
geom_point(data = no.na, aes(x = week_nr, y = sqrt(no.small.fuc.q), colour = site)) +
labs(x = "Weeks since first survey",
y = "sqrt(Small Thalli Density/0.25m^2)",
title= "sqrt(Small Thalli Density/0.25m^2) = 10.25 - 0.33(week) + 0.0038(week)^2") +
theme_minimal()+
scale_color_manual(
name= "Predictive Models",
values = COLORS,
labels = c("Overall Model",
"Paradise Cay",
"Point Chauncey",
"Brickyard Park",
"Horseshoe Bay"
)
)+
theme_bw(base_size=12)