\SweaveOpts{eval=true, echo=true,prefix=FALSE, width=4.5, height=4.5, eps=FALSE, include=FALSE} \chapter{A Brief Introduction to \R} <>= setwd("~/Documents/Projects/Book/draft/Appendix") rm(list=ls()) badfonts = FALSE .detach.stuff = function() { s1 = grep("package|Autoloads",search()) nattach = length(search()) xattach = search()[c(-1,-s1)] for (i in xattach) eval(substitute(detach(i),list(i=i))) } .detach.stuff() ## ps.options(family="NimbusSan") ps.options(colormodel="cmyk") options(width=65) ## palette(gray((0:8)/8)) #library(lattice) trellis.par.set(canonical.theme(color=FALSE)) library(primer) @ \emph{R is a language. Use it every day, and you will learn it quickly.} \S, the precursor to \R, is a quantitative programming environment developed at AT\&T Bell Labs in the 1970s. S-Plus is a commercial, ``value-added'' version and was begun in the 1980s, and \R~was begun in the 1990s by Robert Gentleman and Ross Ihaka of the Statistics Department of the University of Auckland. Nearly 20 senior statisticians provide the core development group of the \R~language, including the primary developer of the original \S~language, John Chambers, of Bell Labs. \R~is an official part of the Free Software Foundation's GNU project\footnote{Pronounced ``g-noo'' --- it is a recursive acronym standing for ``GNU's Not Unix.''} (http://www.fsf.org/). It is free to all, and licensed to stay that way. @ \R~is a language and environment for dynamical and statistical computing and graphics. \R~is similar to the S language, different implementation of \S. Technically speaking, \R~is a ``dialect'' of \S. \R~provides a very wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. \R~has become the lingua franca of academic statistics, and is very useful for a wide variety of computational fields, such as theoretical ecology. \section{Strengths of \textsf{\textsf{R/S}}} \begin{itemize} \item Simple and compact syntax of the language. You can learn \R~quickly, and can accomplish a lot with very little code. \item Extensibility. Anyone can extend the functionality of \R~by writing code. This may be a simple function for personal use, or a whole new family of statistical procedures in a new package. \item A huge variety of statistical and computing procedures. This derives from the ease with which \R/\S~can be extended and shared by users around the world. \item Rapid updates. \item Replicability and validation. All data analyses should be well documented, and this only happens reliably when the analyses are performed with scripts or programs, as in \R~or SAS. Point-and-click procedures cannot be validated by supervisors, reviewers, or auditors. \item Getting help from others is easy. Any scripted language can be quickly and easily shared with someone who can help you. I cannot help someone who says ``first I clicked on this, and then I clicked on that \ldots.'' \item Repetitive tasks simplified. Writing code allows you to do anything you want a huge number of times. It also allows very simple updates with new data. \item High quality graphics. Well-designed publication-quality plots can be produced with ease, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control. \item \R~is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs out of the box on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux). It also compiles and runs on Windows 9x/NT/2000 and Mac OS. \item Accessibility. Go now to www.r-project.org. Type ``\R'' into Google. The \R~Project page is typically the first hit. \end{itemize} There is a tremendous amount of free documentation for \R. \R comes with a selection of manuals under the ``Help'' menu --- explore these first. At the main \R~project web site, see the ``Documentation'' on the left sidebar. The FAQ's are very helpful. The ``Other'' category includes a huge variety of items; search in particular for ``Contributed documentation.''\footnote{I find this when I google ``r `contributed documentation'.''} There you will find long (100+ pages) and short tutorials. You will also find two different ``\R Reference Cards,'' which are useful lists of commonly used functions.\footnote{Try googling `R Reference Card' in quotes.} \section{The \R~Graphical User Interface (GUI)} \R~has a very simple but useful graphical user interface (GUI; Fig. \ref{fig:gui}). A few points regarding the GUI: \begin{itemize} \item ``You call this a graphical user interface?'' Just kidding --- the GUI is \emph{not} designed for point-and-click modeling. \item The \R~GUI is designed for package management and updates. \item The \R~GUI is designed for use with scripts. \end{itemize} \begin{figure}[ht] \centering \includegraphics[width=\textwidth]{Screenshot1.png} \caption{The Mac OS X \R~GUI. Color coded syntax not visible in this figure.} \label{fig:gui} \end{figure} The \R~GUI does not provide a ``statistics package.'' R is a language and programming environment. You can download an \R~package called \texttt{Rcmdr} that provides a point-and-click interface for introductory statistics, if you really, really want to. In my experience, students who plan to use statistics in their research find it more frustrating to learn this interface than to learn to take advantage of the language. The \R~GUI \emph{is} designed for maintenance. With the \R~GUI you can check for updates, and download any of the hundreds of small packages that extend \R~in hundreds of ways. (A package is not unlike a ``PROC,'' for SAS users --- first you tell call it, then you use it). The \R~GUI \emph{is} designed for using scripts. Scripts are text files that contain your analysis; that is, they contain both code to do stuff, and comments about what you are doing and why. These are opened within \R~and allow you to do work and save the code. \begin{itemize} \item Scripts are NOT Microsoft Word documents that require Microsoft Word to open them, but rather, simple text files, such as one could open in Notepad, or SimpleText. \item Scripts are a written record of everything you do along the way to achieving your results. \item Scripts are the core of data analysis, and provide many of the benefits of using a command-driven system, whether \R, \textsf{Matlab}, or some other program or environment. \item Scripts are interactive. I find that because scripts allow me to do anything and record what I do, they are very interactive. They let me try a great variety of different things very quickly. This will be true for you too, once you begin to master the language. \end{itemize} The \R~GUI can be used for simple command line use. At the command line, you can add $2 + 2$ to get 4. You could also do \texttt{ar(lake.phosphorus)} to perform autoregressive time series analysis on a variable called \texttt{lake.phosphorus}, but you would probably want to do that in a script that you can save and edit, to keep track of the analysis that you are doing. \section{Where is \R?} As an Open Source project, \R~is distributed across the web. People all around the world continue to develop it, and much of it is stored on large computer servers (``mirrors'') across the world. Therefore, when you download \R, you download only a portion of it --- the language and a few base packages that help everything run. Hundreds of ``value-added'' packages are available to make particular tasks and groups of tasks easier. We will download one or more of these. It is useful to have a clear conception of where different parts of \R~reside (Fig. \ref{fig:Rcomman}). Computer servers around the world store identical copies of everything (hence the terms archive and ``mirrors''). When you open \R, you load into your computer's virtual, temporary RAM more than just the \R~language --- you automatically load several useful packages including ``base,'' ``stat,'' and others. Many more packages exist (about a dozen come with the normal download) and hundreds are available at each mirror. These are easily downloaded through the \R~GUI. \label{sec:where-r} \begin{figure}[ht] \centering \includegraphics[width=8cm]{RComManage.pdf} \caption{A conceptual representation of where \R~exists. "CRAN" stands for "Comprehensive \R~Archive Network." "RAM" (random access memory) is your computer's active brain; \R~keeps some stuff floating in this active memory and this "stuff" is the \emph{workspace}. \label{fig:Rcomman}} \end{figure} \section{Starting at the Very Beginning} To begin with, we will go through a series of steps that will get you up and running using a script file with which to interact with \R, and using the proper working directory. Start here. \begin{enumerate} \item Create a new directory (i.e., a folder) in ``Documents'' (Mac) or ``My Documents'' (Windows). \emph{and call it} ``Rwork.'' For now, calling it the same thing as everyone else will just simplify your life. If you put ``Rwork'' elsewhere, adjust the relevant code as you go. For now, keep all of your script files and output files into that directory. \item Open the \R~GUI in a manner that is appropriate for your operating system and setup (e.g., double-click the desktop icon). \item Set the \emph{working directory.} You can do this via \textsf{Misc} directory in Mac OS X or in the \textsf{File} menu in Windows using ``Change dir....'' Set the working directory to ``Rwork.'' (If you have not already made an \texttt{Rwork} directory, do so now --- put it in ``Documents'' or ``My Documents.'') \item Open a new \R~script (``New Document'') by using the \textsf{File} menu in the \R~GUI. On the first line, type \texttt{\# My first script} with the pound sign. On the next line, type \texttt{setwd(`$\sim$/Documents/Rwork')} if you are on a Mac, or \texttt{setwd(`C:/Documents and Settings/Users/Jane/My Documents/Rwork')} on Windows, assuming you are named ``Jane;'' if not, use the appropriate pathname. Save this file in ``Rwork;'' save it as ``RIntro.R.'' Windows may hide the fact that it is saving it as a ``.txt'' file. I will assume you can prevent this. \end{enumerate} You should submit code to \R~directly from the script. \emph{Use the script to store your ideas as comments (beginning with \#) and your code, and submit code directly from the script file within \R~(see below for how to do that)}. You do \emph{not} need to cut-and-paste. There are slight differences between operating systems in how to submit code from the script file to the command line. \begin{itemize} \item In Microsoft Windows, place the cursor on a line in the script file \emph{or} highlight a section of code, and then hit \texttt{Ctrl-R} to submit the code. \item In Apple Mac OS X, highlight a section of code and then hit \texttt{Command-return} to submit the code (see the \textsf{Edit} menu). \end{itemize} \emph{From this point on, enter all of the code (indicated in typewriter font, and beginning with ``>'') in your script file, save the file with Command-S (Mac) or Ctrl-S (Windows), and then submit the code as described above.} Where a line begins with ``+,'' ignore the plus sign, because this is just used by \R~ to indicate continued lines of code. You may enter a single line of code on more than one line. \R~will simply continue as if you wrote it all on one line. You can start the help interface with this command. <>= help.start() @ This starts the HTML interface to on-line help (using a web browser available at your machine). You can use help within the \R~GUI, or with this HTML help system. Find out where you are using \texttt{getwd()} (\emph{get} the \emph{w}orking \emph{d}irectory). Use a comment (beginning with \#) to remind yourself what this does. << keep.source=true, results=hide >>= # Here I Get my Working Directory; that is, # I find out which folder R is currently operating from. getwd() @ The precise output will depend on your computer. You can also \emph{set} the \emph{w}orking \emph{d}irectory using \texttt{setwd()}; if you created a directory called \texttt{Rwork} as specified above, one of the following these should work, depending on your operating system. If these both fail, keep trying, or use the menu in the \R~GUI to set the working directory. <>= setwd("~/Documents/Rwork") @ or <>= setwd("C:/Documents and Settings/Users/Jane/My Documents/Rwork") @ On the Mac, a UNIX environment, the tilde-slash ($\sim/$) represents your home directory. \emph{I urge you to use \texttt{setwd} at the beginning of each script file you write so that this script always sets you up to work in a particular, specified directory.} As you write your script file, remember, \begin{itemize} \item Text that begins with ``\#'' will be ignored by \R. \item Text that does not make sense to \R~will cause \R~to return an error message, but will not otherwise mess things up. \end{itemize} Remember that a strength of \R~is that you can provide, to yourself and others, comments that explain every step and every object. To add a comment, simply type one or more ``\#,'' followed by your comment. Finally, have fun, be amused. Now you are ready for programming in \R. \emph{R is a language. Use it every day, and you will learn it quickly.} \chapter{Programming in \R} This material assumes you have completed Appendix A, the overview of \R. Do the rest of this Appendix in a script. Make comments of your own throughout your script. Open and save a new file (or \emph{script}) in \R. Think of your script as a pad of paper, on which you program and \emph{also on which you write notes to yourself}. See Appendix A for instructions. You will want to take copious notes about what you do. Make these notes in the script file, using the pound sign \#. Here is an example: <>= # This will calculate the mean of 10 random standard normal variables. mean( rnorm( 10 ) ) @ You submit this (as described above) from the script directly to the Console with (select) Command-r on a Mac, or Ctrl-r on Windows. \section{Help} You cannot possibly use \R~without using its help facilities. \R~comes with a lot of documentation (see the relevant menu items), and people are writing more all the time (this document is an example!). After working through this tutorial, you could go through the document ``An Introduction to \R'' that comes with \R. You can also browse ``Keywords by Topic'' which is found under ``Search Engine \& Keywords'' in the Help menu. To access help for a specific function, try <<>>= ?mean @ or <>= help(mean) @ The help pages provide a very regular structure. There is a \emph{name}, a brief \emph{description}, its \emph{usage}, its \emph{arguments}, \emph{details} about particular aspects of its use, the \emph{value} (what you get when you use the function), \emph{references}, \emph{links} to other functions, and last, \emph{examples}. If you don't know the exact name of the \R~function you want help with, you can try <>= help.search("mean") apropos("mean") @ These will provide lists of places you can look for functions related to this keyword. Last, a great deal of \R~resides in \emph{packages} online (on duplicate servers around the world). If you are on line, help for functions that you have not yet downloaded can be retrieved with <>= RSiteSearch("violin") RSiteSearch("violin",restrict = c("functions")) @ To learn more about help functions, combine them! <>= help(RSiteSearch) @ \section{Assignment} In general in \R, we perform an action, and take the results of that action and \emph{assign} the results to a new object, thereby creating a new object. Here I add two numbers and assign the result to an new object I call \texttt{a}. <>= a <- 2+3 a @ Note that I use an arrow to make the assignment --- I make the arrow with a less-than sign, $<$, and a dash. Note also that to reveal the contents of the object, I can type the name of the object. I can then use the new object to perform another action, and assign <>= b <- a+a @ I can perform two actions on one line by separating them with a semicolon. <>= a+a; a+b @ Sometimes the semicolon is referred to as an ``end of line'' operator. \section{Data Structures} We refer to a single number as a scalar; a scalar is a single real number. Most objects in \R~are more complex. Here we describe some of these other objects: vectors, matrices, data frames, lists, and functions. \subsection{Vectors} Perhaps the fundamental unit of \R~is the \emph{vector}, and most operations in \R~are performed on vectors. A vector may be just a column of scalars, for instance; this would be a column vector. Here we create a vector called \texttt{Y}. To enter data directly into \R, we will use \texttt{c()} and create an \R~object, in particular a vector. A vector is, in this case, simply a group of numbers arranged in a row or column. Type into your script <<>>= Y <- c(8.3,8.6,10.7,10.8,11,11,11.1,11.2,11.3,11.4) @ where the arrow is a less-than sign, \texttt{<}, and a dash, \texttt{-}. Similarly, you could use <<>>= Y = c(8.3,8.6,10.7,10.8,11,11,11.1,11.2,11.3,11.4) @ These are equivalent. \R~operates (does stuff) to \emph{objects}. Those objects may be \emph{vectors}, \emph{matrices}, \emph{lists}, or some other class of object. \subsubsection{Sequences} I frequently want to create ordered sequences of numbers. \R~has a shortcut for sequences of integers, and a slightly longer method that is completely flexible. First, integers: <<>>= 1:4 4:1 -1:3 -(1:3) @ Now more complex stuff, specifying either the units of the sequence, or the total length of the sequence. <<>>= seq(from=1, to=3, by = .2) seq(1, 3, by=.2) seq(1, 3, length=7) @ I can also fill in with repetitive sequences. Compare carefully these examples. <<>>= rep(1, 3) rep(1:3, 2) rep(1:3, each=2) @ \subsection{Getting information about vectors} Here we can ask \R~to tell us about \texttt{Y}, getting the length (the number of elements), the mean, the maximum, and a six number summary. <<>>= sum(Y) mean(Y) max(Y) length(Y) summary(Y) @ A vector could be character, or logical as well, for instance <<>>= Names <- c("Sarah", "Yunluan") Names b <- c(TRUE, FALSE) b @ Vectors can also be dates, complex numbers, real numbers, integers, or factors. For factors, such as experimental treatments, see section \ref{sec:factors}. We can also ask \R~what classes of data these belong to. <<>>= class(Y) class(b) @ Here we test whether each element of a vector is greater than a particular value or greater than its mean. \emph{When we test an object, we get a logical vector back that tells us, for each element, whether the condition was true or false}. <<>>= Y > 10 Y > mean(Y) @ We test using $>, <, >=, <=, ==, !=$ and other conditions. Here we test whether a vector is equal to a number. <<>>= Y == 11 @ A test of ``not equal to'' <<>>= Y != 11 @ This result turns out to be quite useful, including when we want to \emph{extract} subsets of data. \subsubsection{Algebra with vectors} In \R, we can add, subtract, multiply and divide vectors. When we do this, we are really \emph{operating on the elements in the vectors}. Here we add vectors \texttt{a} and \texttt{b}. <<>>= a <- 1:3 b <- 4:6 a + b @ Similarly, when we multiply or divide, we also operate on each pair of elements in the pair of vectors. <<>>= a*b a/b @ We can also use scalars to operate on vectors. <<>>= a + 1 a * 2 1/a @ What \R~is doing is \emph{recycling} the scalar (the 1 or 2) as many times as it needs to in order to match the length of the vector. Note that if we try to multiply vectors of unequal length, \R~performs the operation but may or may not give a warning. Above, we got no warningmessage. However, if we multiply a vector of length 3 by a vector of length 2, \R~returns a warning. <<>>= a*1:2 @ \R~\emph{recycles the shorter vector} just enough to match the length of the longer vector. The above is the same as <<>>= a*c(1,2,1) @ On the other hand, if we multiply vectors of length 4 and 2, we get no error, because four is a multple of 2. <<>>= 1:4 * 1:2 @ Recycling makes the above the same as the following. <<>>= 1:4 * c(1,2,1,2) @ \subsection{Extraction and missing values} We can extract or subset elements of the vector. I extract subsets of data in two basic ways, by \begin{itemize} \item identifying which rows (or columns) I want (i.e. the first row), or \item providing a \emph{logical} vector (of TRUE's and FALSE's) of the same length as the vector I am subsetting. \end{itemize} Here I use the first method, using a single integer, and a sequence of integers. <<>>= Y[1] Y[1:3] @ Now I want to extract all girths greater than the average girth. Although I don't have to, I remind myself what the logical vector looks like, and then I use it. <<>>= Y > mean(Y) Y[ Y > mean(Y) ] @ Note that I get back all the values of the vector where the condition was TRUE. In \R, \emph{missing data }are of the type ``NA.'' This means ``not available,'' and \R~takes this appellation seriously. thus is you try to calculate the mean of a vector with missing data, \R~resists doing it, because if there are numbers missing from the set, how could it possibly calculate a mean? If you ask it to do something with missing data, the answer will be missing too. Given that \R~treats missing data as missing data (and not something to be casually tossed aside), there are special methods to deal with such data. For instance, we can test which elements are missing with a special function, \texttt{is.na}. <<>>= a <- c(5,3,6, NA) a is.na(a) !is.na(a) a[ !is.na(a) ] na.exclude(a) @ Some functions allow you to remove missing elements on the fly. Here we let a function fail with missing data, and then provide three different ways to get the same thing. <<>>= mean(a) mean(a, na.rm=TRUE) d <- na.exclude(a) mean(d) @ Note that \R~takes missing data seriously. If the fourth element of the set really is missing, I cannot calculate a mean because I don't know what the vector is. @ \subsection{Matrices} A matrix is a two dimensional set of elements, for which \emph{all elements are of the same type}. Here is a character matrix. <<>>= matrix(letters[1:4], ncol=2) @ Here we make a numeric matrix. <<>>= M <- matrix( 1:4, nrow=2 ) M @ Note that the matrix is filled in by columns, or \emph{column major order}. We could also do it by rows. <<>>= M2 <- matrix( 1:4, nrow=2, byrow=TRUE ) M2 @ Here is a matrix with 1s on the diagonal. <<>>= I <- diag(1, nrow=2) I @ The identity matrix plays a special role in matrix algebra; in many ways it is equivalent to the scalar 1. For instance, the inverse of a matrix, \textbf{M}, is \textbf{M}$^{-1}$, which is the matrix which satisfies the equality $\mathbf{MM^{-1}}=\mathbf{I}$, where \textbf{I} is the identity matrix. We solve for the inverse using a few different methods, including <>= Minv <- solve(M) M %*% Minv @ QR decomposition is available (e.g., \texttt{qr.solve()}). Note that \R~recycles the ``1'' until the specified number of rows and columns are filled. If we do not specify the number of rows and columns, \R~fills in the matrix with what you give it (as it did above). \subsubsection{Extraction in matrices} I extract elements of matrices in the same fashion as vectors, but specify both rows and columns. <<>>= M[1,2] M[1, 1:2] @ If I leave either rows or columns blank, \R~returns all rows (or columns). <<>>= M[,2] M[,] @ \subsubsection{Simple matrix algebra} Basic matrix algebra is similar to algebra with scalars, but with a few very important differences. Let us define another matrix. <<>>= N <- matrix(0:3, nrow=2) N @ To perform scalar, or element-wsie operations, we have \begin{align} \label{exM} \vec{A} & = \left( \begin{array}{cc} a & b \\ c& d \end{array} \right); \; \vec{B} = \left(\begin{array}{cc} m & o\\ n & p\end{array}\right)\\ \vec{AB} &= \left( \begin{array}{cc} am & bo\\ cn & dp \end{array}\right) \end{align} The element-wise operation on these two is the default in \R, <<>>= M * N @ where the element in row 1, column 1 in \texttt{M} is multiplied by the element in the same position in \texttt{N}. To perform \emph{matrix mulitplication}, recall from Chap. 2 that, \begin{align} \label{exM} \vec{A} & = \left( \begin{array}{cc} a & b \\ c& d \end{array} \right); \; \vec{B} = \left(\begin{array}{cc} m & o\\ n & p\end{array}\right)\\ \vec{AB} &= \left( \begin{array}{cc} \left( am + bn \right) & \left(ao+bp \right)\\ \left(cm + dn \right) & \left(co + dp \right) \end{array}\right) \end{align} To perform \emph{matrix mulitplication} in \R, we use \texttt{\%*\%}, <<>>= M%*%N @ Refer to Chapter 2 (or ``matrix algebra'' at Wikipedia) for why this is so. Note that matrix multiplication is not commutative, that is, $\mathbf{NM} \ne \mathbf{MN}$. Compare the previous result to <>= N%*%M @ Note that a \emph{vector} in \R~is not defined \emph{a priori} as a column matrix or a row matrix. Rather, it is used as either depending on the circumstances. Thus, we can either left multiply or right multiply a vector of length 2 and \texttt{M}. <<>>= 1:2%*%M M%*%1:2 @ If you want to be very, very clear that your vector is really a matrix with one column (a column vector), you can make it thus. <<>>= V <- matrix(1:2, ncol=1) @ Now when you multiply \texttt{M} by \texttt{V}, you will get the expected sucesses and failure, according to the rules of matrix algebra. <<>>= M %*% V try(V %*% M) @ \R~has formal rules about how it converts vectors to matrices on-the-fly, but it is good to be clear on your own. Other matrix operations are available. Whenever we add or subtract matrices together, or add a matrix and a scalar, it is always element-wise. <<>>= M + N M + 2 @ The transpose of a matrix is the matrix we get when we substitute rows for columns, and columns for rows. To transpose matrices, we use \texttt{t()}. <<>>= t(M) @ More advanced matrix operations are available as well, for singular value decomposition (\texttt{svd}), eigenanalysis (\texttt{eigen}), finding determinants (\texttt{det}), QR decomposition (\texttt{qr}), Choleski factorization (\texttt{chol}), and related functions. The \texttt{Matrix} package was designed to handle with aplomb large sparse matrices. \subsection{Data frames} Data frames are two dimensional, a little like spreadsheets and matrices. All columns having exactly the same number of rows. Unlike matrices, each column can be a different data type (e.g., numeric, integer, charactor, complex, imaginary). For instance, the columns of a data frame could contain the names of species, the experimental treatment used, and the dimensions of species traits, as character, factor, and numeric variables, respectively. <<>>= dat <- data.frame(species=c("S.altissima", "S.rugosa", "E.graminifolia", "A. pilosus"), treatment=factor( c("Control", "Water", "Control", "Water") ), height = c(1.1, 0.8, 0.9, 1.0), width =c(1.0, 1.7, 0.6, 0.2) ) dat @ We can extract data from data frames just the way we can with matrices. <<>>= dat[2,] dat[3,4] @ We can test elements in data frames, as here where I test whether each element column 2 is ``Water.'' I then use that to extract rows of data that are associated with this criterion. <<>>= dat[,2]=="Water" dat[ dat[,2]=="Water", ] @ I could also use the subset function <<>>= subset(dat, treatment == "Water") @ There are advantages to using data frames which will become apparent. \subsubsection{Factors} \label{sec:factors} Factors are a class of data; as such they could belong above with our discussion of character and logical and numeric vectors. I tend, however, to use them in data frames almost exclusively, because I have a data set that includes a bunch of response variables, and \emph{the factors imposed by my experiment}. When defining a factor, \R~by default orders the factor levels in alphabetic order --- we can reorder them as we like. Here I demonstrate each piece of code and then use the pieces to make a factor in one line of code. <<>>= c("Control", "Medium", "High") rep( c("Control", "Medium", "High"), each=3 ) Treatment <- factor( rep( c("Control", "Medium", "High"), each=3 ) ) Treatment @ Note that \R~orders the factor alphabetically. This may be relevant if we do something with the factor, such as when we plot it (Fig. \ref{fig:strip1}). <>= quartz(,4,4) <<>>= levels(Treatment) stripchart(1:9 ~ Treatment) <>= dev.print(pdf, "strip.pdf") @ \begin{figure}[ht] \centering \subfloat[Before]{\includegraphics[width=4cm]{strip.pdf} \label{fig:strip1}} \subfloat[After]{\includegraphics[width=4cm]{strip2.pdf} \label{fig:strip2}} \caption{Graphics before and after the factor was releveled to place the factor levels in a logical order.} \end{figure} Now we can re-specify the factor, telling \R~the order of the levels we want, taking care to remember that \R~can tell the difference between upper and lower case (Fig. \ref{fig:strip2}). See also the function {\tt relevel}. <>= quartz(,4,4) <<>>= Treatment <- factor( rep( c("Control", "Medium", "High"), each=3 ), levels=c("Control", "Medium", "High") ) levels(Treatment) stripchart(1:9 ~ Treatment) <>= dev.print(pdf, "strip2.pdf") @ \subsection{Lists} An amazing data structure that \R~boasts is the \emph{list}. A \emph{list} is simply a collection of other objects kept together in a hierarchical structure. Each component of the list can be a complete different class of object. Let's build one. <<>>= my.list <- list(My.Y = Y, b = b, Names, Weed.data=dat, My.matrix = M2, my.no = 4 ) my.list @ We see that this list is a set of objects: a numeric vector, a logical vector, a character vector, a data frame, a matrix, and a scalar (a number). Lists can be nested within other lists. Note that if we do not specify a name for a component, we can still extract it using the number of the component. I extract list components in several ways, including by name, and by number (see \texttt{?'['} for more information). <<>>= my.list[["b"]] my.list[[2]] @ If I use a name, there are a few ways, including <<>>= my.list[["b"]] my.list$b @ If by number, that are two ways, with one or two brackets. In addition to two brackets, as above, we can use one bracket. This allows for extraction of more than one component of the list. <<>>= my.list[1:2] @ Note that I can extract a subset of one component. <<>>= my.list[["b"]][1] @ If one way of extraction is working for you, experiment with others. \subsection{Data frames are also lists} You can also think of a data frame as a list of columns of identical length. I like to extract columns the same way --- by name. <<>>= mean( dat$height ) @ \section{Functions} A function is a command that does something. You have already been using functions, throughout this document. Let's examine functions more closely. Among other things, a function has a name, arguments, and values. For instance, <>= help(mean) @ This will open the help page (again), showing us the \emph{arguments}. The first argument \texttt{x} is the object for which a mean will be calculated. The second argument is \texttt{trim=0}. If we read about this argument, we find that it will ``trim'' a specified fraction of the most extreme observations of \texttt{x}. \emph{The fact that the argument \texttt{trim} is already set equal to zero means that is the default}. If you do not use \texttt{trim}, then the function will use \texttt{trim=0}. Thus, these two are equivalent. <<>>= mean(1:4) mean(1:4, trim=0) @ \R~is an ``object-oriented'' language. A consequence of this is that the same function name will perform different actions, depending on the \emph{class} of the object.\footnote{\R~has hundreds of built-in data sets for demonstrating things. We use one here called 'warpbreaks.' You can find some others by typing \texttt{data()}.} <<>>= class(1:10) class(warpbreaks) summary( 1:10) summary(warpbreaks) @ In the \texttt{warpbreaks} data frame, summary provides the six number summary for each numeric or integer column, but provides ``tables'' of the factors, that is, it counts the occurrences of each level of a factor and sorts the levels. When we use summary on a linear model, we get output of the regression, <<>>= summary( lm(breaks ~ wool, data=warpbreaks) ) @ \subsection{Writing your own functions} \label{sec:writing-your-own} One very cool thing in \R~is that you can write your own functions. Indeed it is the extensibility of \R~that makes it the home of cutting edge working, because edge cutters (i.e., leading scientists) can write code that we all can use. People actually write entire \emph{packages}, which are integrated collections of functions, and \R~has been extended with hundreds of such packages available for download at all the \R~mirrors. Let's make our own function to calculate a mean. Let's further pretend you work for an unethical boss who wants you to show that average sales are higher than they really are. Therefore your function should provide a mean plus 5\%. <<>>= MyBogusMean <- function(x, cheat=0.05) { SumOfX <- sum(x) n <- length(x) trueMean <- SumOfX/n (1 + cheat) * trueMean} RealSales <- c(100, 200, 300) MyBogusMean(RealSales) @ Thus a function can take any input, do stuff, including produce graphics, or interact with the operating system, or manipulated numbers. You decide on the arguments of the function, in this case, \texttt{x} and \texttt{cheat}. Note that we supplied a number for \texttt{cheat}; this results in the \texttt{cheat} argument having a \emph{default} value, and we do not have to supply it. If an argument does not have a default, we have to supply it. If there is a default value, we can change it. Now try these. <<>>= MyBogusMean(RealSales, cheat=0.1) MyBogusMean(RealSales, cheat=0) @ \section{Sorting} We often like to sort our numbers and our data sets; a single vector is easy. To do something else is only a little more difficult. <<>>= e <- c(5, 4, 2, 1, 3) e sort(e) sort(e, decreasing=TRUE) @ If we want to sort all the rows of a data frame, keeping records (rows) intact, we can use \texttt{order}. This function is a little tricky, so we explore its use in a vector. <<>>= e order(e) e[ order(e) ] @ Here \texttt{order} generates an \emph{index} to properly \emph{order} something. Above, this index is used to tell \R~to select the 4th element of \texttt{e} first --- \texttt{order} puts the number '4' into the first spot, indicating that \R~should put the 4th element of \texttt{e} first. Next, it places '3' in the second spot because the 3rd element of \texttt{e} belongs in the 2nd spot of an ordered vector, and so on. We can use \texttt{order} to sort the rows of a data frame. Here I order the rows of the data frame according to increasing order of plant heights. <<>>= dat order.nos <- order(dat$height) order.nos @ This tells us that to order the rows, we have to use the 2nd row of the original data frame as the first row in the ordered data frame, the 3rd row as the new second row, etc. Now we use this index to select the rows of the original data frame in the correct order to sort the whole data frame. <<>>= dat[order.nos, ] @ We can reverse this too, of course. <<>>= dat[ rev(order.nos), ] @ \section{Iterated Actions: the \texttt{apply} Family and Loops} \label{sec:iter-acti-apply} We often want to perform an action again and again and again\ldots, perhaps thousands or millions of times. In some cases, each action is independent --- we just want to do it a lot. In these cases, we have a choice of methods. Other times, each action depends on the previous action. In this case, I always use for-loops.\footnote{There are other methods we could use. These are discussed by others, under various topics, including ``flow control.'' We use ODE solvers for continuous ordinary differential equations.} Here I discuss first methods that work only for independent actions. \subsection{Iterations of independent actions} Imagine that we have a matrix or data frame and we want to do the same thing to each column (or row). For this we use \texttt{apply}, to ``apply'' a function to each column (or row). We tell \texttt{apply} what data we want to use, we tell it the ``margin'' we want to focus on, and then we tell it the function. The \emph{margin} is the \emph{side} of the matrix. We describe matrices by their number of rows, then columns, as in ``a 2 by 5 matrix,'' so rows constitute the first margin, and columns constitute the second margin. Here we create a $2 \times 5$ matrix, and take the mean of rows, for the first margin. Then we sum the columns for the second margin. <<>>= m <- matrix( 1:10, nrow=2) m apply(m, MARGIN=1, mean) apply(m, MARGIN=2, sum) @ See \texttt{?rowMeans} for simple, and even faster, operations. Similarly, \texttt{lapply} will ``apply'' a function to each element of a list, or each column of a data frame, and always returns a list. \texttt{sapply} does something similar, but will simplify the result, to a less complex data structure if possible. Here we do an independent operation 10 times using {\tt sapply}, defining a function on-the-fly to calculate the mean of a random draw of five observations from the standard normal distribution. <<>>= sapply(1:10, function(i) mean( rnorm(5) ) ) @ \subsection{Dependent iterations} Often the repeated actions depend on previous outcomes, as with population growth. Here we provide a couple of examples where we accomplish this with \emph{for loops}. One thing to keep in mind for \emph{for loops} in \R: the computation of this is fastest if we first make a holder for the output. Here I simulate a random walk, where, for instance, we start with 25 individuals at time = 0, and increase or decrease by some amount that is drawn randomly from a normal distribution, with a mean of zero and a standard deviation 2. We will round the ``amount'' to the nearest integer (the zero-th decimal place). Your output will differ because it is a random process. <<>>= gens <- 10 output <- numeric(gens + 1) output[1] <- 25 for( t in 1:gens ) output[t + 1] <- output[t] + round( rnorm(n=1, mean=0, sd=2), 0) output @ \section{Rearranging and Aggregating Data Frames} \subsection{Rearranging or reshaping data} We often need to rearrange our data. A common example in ecology is to collect repeated measurements of an experimental unit and enter the data into multiple columns of a spreadsheet, creating a \emph{wide} format. \R~prefers to analyze data in a single column, in a \emph{long} format. Here we use {\tt reshape} to rearrange this. These data are carbon dioxide uptake in 12 individual plants. They are currently structured as longitudinal data; here we rearrange them in the wide format, as if we record uptake seven sequential observations on each plant in different columns. See {\tt ?reshape} for details. Here {\tt v.names} refers to the column name of the response variable, {\tt idvar} refers to the column name for the variable that identifies an individual on which we have repeated measurements, and {\tt timevar} refers to the column name which identifies different observations of {\em the same individual plant}. <<>>= summary(CO2) CO2.wide <- reshape(CO2, v.names="uptake", idvar="Plant", timevar="conc", direction="wide") names(CO2.wide) @ This is often how we might record data, with an experimental unit (individual, or plot) occupying a single row. If we import the data in this format, we would typically like to reorganize it in the long format, because most analyses we want to do may require this. Here, {\tt v.names} and {\tt timevar} are the names we want to use for some new columns, for the response variable and the identifier of the repeated measurement (typically the latter may be a time interval, but here it is a CO$_2$ concentration). {\tt times} supplies the identifier for each repeated observation. <<>>= CO2.long <- reshape(CO2.wide, v.names="Uptake", varying=list(4:10), timevar="Concentration", times=c(95, 175, 250, 350, 500, 675,1000) ) head(CO2.long) @ If we wanted to, we could use {\tt order()} to re-sort the data frame, for instance to match the original. <<>>= CO2.long2 <- with( CO2.long, CO2.long[order(Plant, Concentration),]) head(CO2.long2) @ See also the very simple functions {\tt stack} and {\tt unstack}. \subsection{Summarizing by groups} We often want to summarize a column of data \emph{by groups} identified in another column. Here I summarize CO$_2$ uptake by the means of each experimental treatment, chilling. The code below provides the column to be summarized ({\tt uptake}), a vector (or list of vectors) containing the group id's, and the function to use to summarize each subset (means). We calculate the mean CO$_2$ uptake for each group. <<>>= tapply(CO2[["uptake"]], list(CO2[["Treatment"]]), mean) @ We can get fancier, as well, with combinations of groups, for each combination of Type and Treatment. <<>>= tapply(CO2[["uptake"]], list(CO2[["Treatment"]], CO2[["Type"]]), sd ) @ We can also define a function on-the-fly to calculate both mean and standard deviation of Type and Treatment combination. We will need, however, to define groups differently, by creating the interaction of the two factors. <<>>= tapply(CO2[["uptake"]], list(CO2[["Treatment"]], CO2[["Type"]]), function(x) c(mean(x), sd(x)) ) @ See also \texttt{by} that actually uses {\tt tapply} to operate on data frames. When we summarize data, as in \texttt{tapply}, we often want the result in a nice neat data frame. The function \texttt{aggregate} does this. Its use is a bit like {\tt tapply} --- you provide (i) the numeric columns of a data frame, or a matrix, (ii) a list of named factors by which to organize the responses, and then (iii) the function to summarize (or aggregate) the data. Here we summarize both concentration and uptake. <<>>= aggregate(CO2[,4:5], list(Plant = CO2[["Plant"]]), mean) @ A separate package entitled {\tt reshape} supplies some very elegant and intuitive approaches to the sorting, reshaping and aggregating of data frames. I typically use the \texttt{reshape} \emph{package} (with functions \texttt{melt} and \texttt{cast}), rather than the \texttt{reshape} function supplied in the \texttt{stat}. I do so merely because I find it a little more intuitive. \R~also has strong connections to relational database systems such as MySQL. \section{Getting Data out of and into the Workspace} We often want to get data into \R, and we sometimes want to get it out, as well. Here we start with the latter (referred to as \emph{writing} data), and finish with the former (referred to as \emph{reading} data). Here I create a data frame of numbers, and write it to a text file in two different formats. The first is a file where the observations in each row are separated by tabs, and the second separates them by commas. <<>>= dat <- data.frame(Name=rep(c("Control","Treatment"), each=5), First=runif(10), Second=rnorm(1)) write.table(dat, file="dat.txt") write.csv(dat, file="dat.csv") @ Open these in a spreadsheet such as Calc (in OpenOffice and NeoOffice). We can then read these into \R~using the \texttt{read.*} family of functions. <<>>= dat.new <- read.csv("dat.csv") dat.new2 <- read.table("dat.txt", header=TRUE) @ These objects will both be data frames. Now let's get a statistical summary and export that. <<>>= mod.out <- summary( aov(First ~ Name, data=dat)) mod.out[[1]] write.csv(mod.out[[1]], "ModelANOVA.csv") @ Open this in a spreadsheet, such as Calc, in OpenOffice, or in any other application. See also the \texttt{xtable} package for making tables in \LaTeX{} or HTML formats. \section{Probability Distributions and Randomization} \R~has a variety of probability distributions built-in. For the normal distribution, for instance, there are four functions: \begin{description} \item[\texttt{dnorm}] The probability density function, that creates the widely observed bell-shaped curve. \item[\texttt{pnorm}] The cumulative probability function that we usually use to describe the probability that a test statistic is greater than or equal to a critical value. \item[\texttt{qnorm}] The quantile function that takes probabilities as input. \item[\texttt{rnorm}] A random number generator which draws values (quantiles) from a distribution with a specified mean and standard deviation. \end{description} For each of these, default parameter values return the standard normal distribution ($\mu=0$, $\sigma=1$), but these parameters can be changed. Here we have the 95\% confidence intervals. <<>>= qnorm(p=c(0.025,0.975)) @ Next we create a histogram using 20 random draws from a normal distribution with a mean of 11 and a standard deviation of 6; we overlay this with the probability density function (Fig. \ref{fig:myhist}). <>= myplot <- hist(rnorm(20, m=11, sd=6), probability=TRUE) myplot lines(myplot$mids, dnorm(myplot$mids, m=11, sd=6) ) @ \begin{figure}[ht] \centering \includegraphics[width=5cm]{myhist.pdf} \caption{Histogram of random numbers drawn from a normal distribution with $\mu=11$ and $\sigma=6$. The normal probability density function is drawn as well.} \label{fig:myhist} \end{figure} \section{Numerical integration of ordinary differential equations} \label{sec:numer-integr-ordin} In order to study continuous population dynamics, we often would like to integrate complex nonlinear functions of population dynamics. To do this, we need to use numerical techniques that turn the infinitely small steps of calculus, $\D x$, into very small, but finite steps, in order to approximate the change in $y$, given the change in $x$, or $\D y /\D x$. Mathematicians and computer scientists have devised very clever ways of doing this very accurately and precisely. In \R, the best package for this is \texttt{deSolve}, which contains several \emph{solvers} for differential equations that perform numerical integration. We will access these solvers (i.e. numerical integraters) using the \texttt{ode} function in the \texttt{deSolve} package. This function, \texttt{ode}, is a ``wrapper'' for the underlying suite of functions that do the work. That is, it provides a simple way to use any one of the small suite of functions. When we have an ordinary differential equation (ODE) such as logistic growth,\footnote{$e.g. dN/dt = rN(1-\alpha N)$} we say that we ``solve'' the equation for a particular time interval given a set of parameters and initial conditions or initial population size. For instance, we say that we solve the logistic growth model for time at $t=0,\, 1 \ldots \, 20$, with parameters $r=1$, $\alpha=0.001$, and $N_0=10$. Let's do an example with \texttt{ode}, using logistic growth. We first have to define a function in a particular way. The arguments for the function must be time, a vector of populations, and a vector or list of model parameters. <<>>= logGrowth <- function(t, y, p){ N <- y[1] with(as.list(p), { dN.dt <- r * N * (1 - a * N) return( list( dN.dt ) ) } ) } @ Note that I like to convert $y$ into a readable or transparent state variable ($N$ in this case). I also like to use \texttt{with} which allows me to use the names of my parameters \cite{Petzoldt:2003dp}; this works only is \texttt{p} is a vector with named paramters (see below). Finally, we return the derivative as a list of one component. The following is equivalent, but slightly less readable or transparent. <<>>= logGrowth <- function(t, y, p){ dN.dt <- p[1] * y[1] * (1 - p[2] * y[1]) return( list( dN.dt ) ) } @ To solve the ODE, we will need to specify parameters, and initial conditions. Because we are using a vector of named parameters, we need to make sure we name them! We also need to supply the time steps we want. <<>>= p <- c(r=1, a = 0.001) y0 <- c(N=10) t <- 1:20 @ Now you put it all into \texttt{ode}, with the correct arguments. The output is a matrix, with the first column being the time steps, and the remaining being your state variables. First we load the \texttt{deSolve} package. <<>>= library(deSolve) out <- ode(y=y0, times=t, func=logGrowth, parms=p) out[1:5,] @ If you are going to model more than two species, y becomes a vector of length 2. Here we create a function for Lotka-Volterra competition, where \begin{align} \label{eq:1} \frac{\D N_1}{\D t} &= r_1N_1\left(1 - \alpha_{11}N_1 - \alpha_{12}N_2\right)\\ \frac{\D N_2}{\D t} &= r_2N_2\left(1 - \alpha_{22}N_2 - \alpha_{21}N_1\right)\\ \end{align} <<>>= LVComp <- function(t, y, p){ N <- y with(as.list(p), { dN1.dt <- r[1] * N[1] * (1 - a[1,1]*N[1] - a[1,2]*N[2]) dN2.dt <- r[2] * N[2] * (1 - a[2,1]*N[1] - a[2,2]*N[2]) return( list( c(dN1.dt, dN2.dt) ) ) } ) } @ Note that \texttt{LVComp} assumes that $N$ and $r$ are vectors, and the competition coefficients are in a matrix. For instance, the function extracts the the first element of \texttt{r} for the first species (\texttt{r[1]}); for the intraspecific competition coefficient for species 1, it uses the element of \texttt{a} that is in the first column and first row (\texttt{a[1,1]}). The vector of population sizes, $N$, contains one value for each population \emph{at one time point}. Thus here, the vector contains only two elements (one for each of the two species); it holds only these values, but will do so repeatedly, at each time point. Only the output will contain all of the population sizxes through time. To integrate these populations, we need to specify new initial conditions, and new parameters for the two-species model. <<>>= a <- matrix(c(0.02, 0.01, 0.01, 0.03), nrow=2) r <- c(1,1) p2 <- list(r, a) N0 <- c(10,10) t2 <- c(1,5,10,20) out <- ode(y=N0, times=t2, func=LVComp, parms=p2) out[1:4,] @ The \texttt{ode} function uses a superb ODE solver, \texttt{lsoda}, which is a very powerful, well tested tool, superior to many other such solvers. In addition, it has several bells and whistles that we will not need to take advantage of here, although I will mention one, \texttt{hmax}. This tells \texttt{lsoda} the largest step it can take. Once in a great while, with a very \emph{stiff} ODE (a very wiggly complex dynamic), ODE assumes it can take a bigger step than it should. Setting \texttt{hmax} to a smallish number will limit the size of the step to ensure that the integration proceeds as it should. One of the other solvers in the \texttt{deSolve}, \texttt{lsodar}, will also return roots (or equilibria), for a system of ODEs, if they exist. Here we find the roots (i.e. the solutions, or equilibria) for a two species enemy-victim model. <<>>= EV <- function(t, y, p){ with(as.list(p), { dv.dt <- b * y[1]*(1-.005*y[1]) - a*y[1]*y[2] de.dt <-a*e*y[1]*y[2] - s*y[2] return( list( c(dv.dt, de.dt) ) ) } ) } @ To use \texttt{lsodar} to find equilibria, we need to specify a root finding function whose inputs are are the sme of the ODE function, and which returns a scalar (a single number) that determines whether the rate of change ($\D y/\D x$) is sufficiently close to zero that we can say that the system has stopped changed, that is, has reached a steady state or equilibrium. Here we sum the absolute rates of change of each species, and then subtract 10$^{-10}$; if that difference is zero, we decide that, for all pratcial purposes, the system has stopped changing. <<>>= rootfun <- function (t,y,p) { dstate <- unlist(EV(t,y,p)) #rate of change vector return(sum(abs(dstate))-1e-10) } @ Note that \texttt{unlist} changes the \texttt{list} returned by \texttt{EV} into a simple vector, which can then be summed. Next we specify parameters, and time. Here all we want is the root, so we specify that we want the value of $y$ after a really long time ($t=10^{10}$). The \texttt{lsodar} function will stop sooner than that, and return the equilibrium it finds, and the time step at which it occurred. <<>>= p <- c(b = 0.5, a = 0.02, e=0.1, s=0.2) t <- c(0,1e10) @ Now we run the function. <<>>= out <- ode(y=c(45,200), t, EV, parms=p, rootfun=rootfun, method="lsodar") out[,] @ Here we see that the steady state population sizes are $V=100$ and $E=12.5$, and that given our starting point, this steady state was achieved at $t=500.8$. Other information is available; see \texttt{?lsodar} after loading the \texttt{deSolve package}. \section{Numerical Optimization} \label{sec:numer-optim} We frequently have a function or a model that we think can describe a pattern or process, but we need to ``play around with'' the numerical values of the constants in order to make the right shape with our function/model. That is, we need to find the value of the constant (or constants) that create the ``best'' representation of our data. This problem is known as \emph{optimization}. Optimization is an entire scientific discipline (or two). It boils down to quickly and efficiently finding parameters (i.e. constants) that meet our criteria. This is what we are doing when we ``do'' statistics. We fit models to data by telling the computer the structure of the model, and asking it to find values of the constants that minimize the residual error. Once you have a model of the reality you want to describe, the basic steps toward optimization we consider are (i) create an \emph{objective function}, (ii) use a routine to \emph{minimize} (or \emph{maximize}) the objective function through optimal choice of parameter values, and (iii) see if the ``optimal'' parameters values make sense, and perhaps refine and interpret them. An \emph{objective function} compares the data to the predicted values from the model, and returns a quantitative measure of their difference. One widely used objective function the \emph{least-squares criterion}, that is, the objective function is the average or the sum of the squared deviations between the model values and the data --- just like a simple ANOVA might. An optimization routine then tries to find model parameters that minimize this criterion. Another widely used objective function is the likelihood function, or \emph{maximum likelihood}. The likelihood function uses a probability distribution of our choice (often the normal distribution). The objective function then calculates the collective probability of observing those data, given the parameters and fit of the model. In other words, we pretend that the model and the predicted values are true, measure how far off each datum is from the predicted value, and then use a probability distribution to calculate the probability of seeing each datum. It then multiplies all those probabilities to get the \emph{likelihood} of observing those data, given the selected parameters. An optimization routine then tries to find model parameters that maximize this likelihood. In practice, it is more computationally stable to calculate the negative of the sum of the logarithms of the probabilities, and try to minimize that quantity, rather than maximize the likelihood --- but in principle they are they same thing. Once we have an objective function, an optimization routine makes educated guesses regarding good values for the parameters, until it finds the best values it can, those which minimize the objective function. There are a great variety of optimization routines, and they all have their strengths. One important tradeoff they exhibit is that the fastest and most accurate methods are sometimes the least able to handle difficult data \cite{Bolker:2008rr}. Below, we rely on a combination to take advantage of the strengths of each type. Here we introduce two of \R's general purpose functions in the \texttt{base} package, \texttt{optimize} and \texttt{optim}, and another, in the \texttt{bbmle} package, \texttt{mle2} \cite{Bolker:2008rr}. The function \texttt{optimize} should be used where we are in search of one parameter; use others when more than one parameter is being optimized. There are many other optimization routines in \R, but we start here\footnote{Indeed, all statistical models are fancy optimization routines.}. Here we start with one of \R's general optimization functions, the one designed for finding a single parameter. Let us find the mean ($\bar{x}$) of some data through optimization. We will start with data, \texttt{y}, and let our conceptual model of that data be $\mu$, the mean. We then create a \emph{objective function} whose output will get smaller as the parameter of our model approaches the value we want. Poughly speaking, the mean is the value that minimizes the total difference between all the data and the itself. We will use the least-squares criterion, where the sum of all the squared deviations reaches a minimum when $\mu$ approaches the mean. <<>>= y <- c(1, 0:10) f <- function(y, mu) { sum( (y-mu)^2 ) } @ Our function, \texttt{f}, subtracts $\mu$ from each value of $y$, squares each of these differences, and then sums these squared differences, to get the sum of squares. Our goal is to minimize this. If we guessed at it by hand, we would get this (Fig. \ref{fig:LS}). <>= guesses <- seq(4, 6, by=.05) LS.criterion <- sapply(guesses, function(mu) f(mu=mu, y=y)) plot(guesses, LS.criterion, type='l') @ \begin{figure}[ht] \centering \includegraphics[width=8cm]{LS} \caption{Illustration of the least squares criterion. Our objective function returns (i.e. generates) the squared deviations between the fitted model and the data. Optimization minimizes the criterion (``LS.criterion'') and thereby finds the right guess (x axis).} \label{fig:LS} \end{figure} Fig. \ref{fig:LS} shows us that the minimum of the objective function occurs when \texttt{mu} is a little over 4.5. Now let's let \R~minimize our least squared deviations. With \texttt{optimize}, we provide the function first, we then provide a range of possible values for the parameter of interest, and then give it the values of parameters or data used by the function, other than the parameter we want to fit. <>= (results <- optimize(f, c(0,10), y=y)) @ We see that \texttt{optimize} returns two components in a list. The first is called \texttt{minimum}, which is the parameter value that causes our function \texttt{f} to be at a minimum. The second component, \texttt{objective} is the value of {\tt f} when \Sexpr{paste('mu =', round(results[[1]],3))}. Next we demonstrate \texttt{mle2}, a function for maximum likelihood estimation. Maximum likelihood relies on probability distributions to find the probability of observing a particular data set, assuming the model is correct. This class of optimization routines finds the parameters that maximize that probability. Let us solve the same problem as above. For the same data, \texttt{y}, we create a maximum likelihood function to calculate the mean. In maximum likelihood, we actually minimize the negative logarithm of the likelihood because it is more computationally stable --- the same parameters that minimize the negative log-likelihood also maximize the likelihood. We assume that the data are normally distributed, so it makes sense to assume that the probabilities derive from the normal probability density function. <<>>= LL <- function(mu, SD){ -sum( dnorm(y, mean=mu, sd=SD, log=TRUE) ) } @ This objective function calculates the negative logarithm of the probability density of each datum, given a particular mean and standard deviation, \texttt{mu, SD}. The optimization routine, \texttt{mle2}, then finds \texttt{mu} and \texttt{SD} that minimize the negative log-likelihood of those data. <>= library(bbmle) (fit <- mle2(LL, start=list(mu=5, SD=1), control=list(maxit=10^5) ) ) @ Another way to put this objective function into \texttt{mle2} is with a formula interface. <<>>= mle2(y ~ dnorm(mu, sd=SD), start=list(mu=1, SD=2)) @ We can examine this more closely, examing the probablities associated with the profile confidence intervals. <<>>= summary(fit) pr <- profile(fit) <>= par(mar=c(5,4,3,2)) plot(pr) @ \begin{figure}[ht] \centering \includegraphics[width=\linewidth]{profile} \caption{Profile confidence intervals for various limits, based on \texttt{mle2}.} \label{fig:pci} \end{figure} Often we have reason to limit parameters to particular bounds. Most often, we may need to ensure that a parameter is greater than zero, or less than zero, or less often between zero and one. Sometimes we have a rationale based on physical or biological constraints that will limit a parameter within particular values. To constrain parameters, we could use a routine that applies constraints directly (see particular optimization methods under \texttt{mle2},\texttt{nlminb}, and \texttt{optim}). We could also transform the parameters, so that the optimizer uses the transformed version, while the ODE model uses the parameters in their original units. For instance, we could let the optimizer find the best value of a logarithm of our parameter that allows the original parameter to make model predictions that fit the data. Another consequence of using logarithms, rather than the original scale is that it facilitates computational procedures in estimating vary large and very small numbers. An example helps make this clear --- see an extended example in Chap. 6, on disease models. @ \section{Derivatives} We can use \texttt{deriv} and \texttt{D} to have \R~provides derivatives. First we supply an expression, and then we get gradients. <<>>= host1 <- expression(R*H*(1+a*P)^-k) D(host1, "H") @ \section{Graphics} \R~is well known for its graphics capabilities, and entire books have been written of the subject(s). For beginners, however, \R~can be frustrating when compared to the point-and-click systems of most graphics ``packages.'' This frustration derives from two issues. First, \R's graphics have of a learning curve, and second, \R~requires us to type in, or code, our specifications. The upsides of these are that \R~has infinite flexibility, and total replicability, so that we get exactly the right figure, and the same figure, every time we run the same code. \subsection{\texttt{plot}} The most used graphics function is \texttt{plot}. Here I demonstrate several uses. First let's just create the simplest scatterplot (Fig. \ref{fig:eg1}). <>= data(trees); attach(trees) plot(Girth, Height) @ \begin{figure}[ht] \centering \subfloat[Simple]{\includegraphics[width=.48\linewidth]{Myplot.pdf}\label{fig:eg1}} \subfloat[With Colors]{\includegraphics[width=.48\linewidth]{MyTrees1.pdf}\label{fig:eg2}}\\ \subfloat[Two Y vars.]{\includegraphics[width=.48\linewidth]{Mymatplot.pdf}\label{fig:eg3}} \subfloat[Two Y axes]{\includegraphics[width=.48\linewidth]{MyTrees2Y.pdf}\label{fig:eg4}} \caption[Examples of graphics]{See code for graphics parameters used to generate these plots. Fig. \subref{fig:eg2} uses an alternate color scheme that provides human perception-adjusted hsv (hue, saturation, and value) specification.} \label{fig:eg} \end{figure} To this we can add a huge variety of variation, using arguments to \texttt{plot}. \begin{table}[ht] \caption{Commonly used arguments to \texttt{plot}. See help pages at \texttt{?plot} and \texttt{?plot.default} for more information.} \centering \begin{tabular}{lp{9cm}} \hline Argument & Meaning \\ \hline \texttt{type} & Determines the type of X-Y plot, for example \textbf{p}, {\bf l}, {\bf s}, for points, lines, stair-step, and none, respectively. ``None'' is useful for setting up a plotting region upon which to elaborate (see example below). Defaults to {\bf p}; see ?plot.default for other types. \\ \texttt{axes} & Indicates whether to plot the axes; defaults to TRUE. Useful if you want more control over each axis by using the \texttt{axis} function separately (see below).\\ \texttt{pch} & Point character (numeric value, 1--21). This can be a single value for an entire plot, or take on a unique value for each point, or anything in between. Defaults to 1. \emph{To add text more than one character in length, for instance a species name, we can easily add text to a plot at each point (see the next section)}.\\ \texttt{lty} & Line type, such as solid (1), dashed (2), etc. Defaults to 1.\\ \texttt{lwd} & Line width (numeric value usually 0.5--3; default is 1). \\ \texttt{col} & Color; can be specified by number (e.g., 2), or character (e.g. ``red''). Defaults to 1 (``black''). \R~has tremendous options for color; see \texttt{?hcl}.\\ \texttt{main, ylab, xlab~} & Text for main title, or axis labels. \\ \texttt{xlim, ylim} & Limits for x and y axes, e.g. \texttt{ylim=c(0, 1.5)} sets the limits for the y-axis at zero and 1.5. Defaults are calcualted from the data. \\ \texttt{log} & Indicates which axes should use a (natural) logarithm scale, e.g. \texttt{log = `xy'} causes both axes to use logarithmic scales.\\ \hline \end{tabular} \end{table} \subsection{Adding points, lines and text to a plot} After we have started a plot, we may want to add more data or information. Here set up a new graph without plotting points, add text at each point, then more points, a line and some text. <>= par(mar=c(5,4,3,2)) plot(Girth, Volume, type='n', main = "My Trees") points(Girth, Volume, type='h' , col="lightgrey", pch=19) @ Now we want to add points for these data, using the tree heights as the plotting symbol. We are going to use an alternate coloring system, designed with human perception in mind (\texttt{hcl}). We scale the colors so that the hue varies between 30 and 300, depending on the height of the tree; I allow the symbols to be transparent (90\% opaque) overlapping. I also allow the size of the numbers to vary with height ({\tt cex = 0.5 + hts}) Last, we add a legend (Fig. \ref{fig:eg2}). <>= hts <- (Height-min(Height))/max(Height-min(Height)) my.colors <- hcl(h=30 + 270 * hts, alpha=0.90) text(Girth, Volume, Height, col=my.colors, cex = 0.5 + hts ) <>= par(mar=c(5,4,3,2)) plot(Girth, Volume, type='n', main = "My Trees") points(Girth, Volume, type='h' , col="lightgrey", pch=19) hts <- (Height-min(Height))/max(Height-min(Height)) my.colors <- hcl(h=30 + 270 * hts, l=50, alpha=0.90) text(Girth, Volume, Height, col=my.colors, cex = 0.5 + hts ) @ \subsection{More than one response variable} We often plot more than one response variable on a single axis. We could use {\tt lines} or {\tt points} to add each additional variable. We could also use {\tt matplot} to plot a matrix of variables {\em vs.} one predictor (Fig. \ref{fig:eg3}). <>= trees.sort <- trees[order(trees$Girth, trees$Height),] matplot(trees.sort$Girth, trees.sort[,2:3], type='b') text(18, 40, "Volume", col="darkred") text(10, 58, "Height") @ We frequently want to add a second y-axis to a graph that has a different scale (Fig. \ref{fig:eg4}). The trick we use here is that we plot a graph, but then tell \R~we want to do the next command ``\ldots \emph{as if it was on a new device}''\footnote{From the \texttt{par} help page.} while it really is not. We overlay what we just did with new stuff, without clearing the previous stuff. For our example, let's start with just X and our first Y. Note we also specify extra margin space room on the right hand side, preparing for the second Y axis. <<>>= quartz(,4,4) par(mar=c(5,4,2,4)) plot(Girth, Volume, main = "My Trees") @ Now we try our trick. We draw a new plot ``as if'' it were a new graph. We use the same X values, and the new Y data, and we also specify no labels. We also use a different line type, for clarity. <<>>= par(new=TRUE) plot(Girth, Height, axes=FALSE, bty="n", xlab="", ylab="", pch=3) @ Now we put the new Y values on the fourth side, the right hand Y axis. We add a Y axis label using a function for \emph{marginal text} (Fig. \ref{fig:eg4}). <<>>= axis(4) mtext("Height", side=4, line=3) <>= par(mar=c(5,4,2,4)) plot(Girth, Volume, main = "My Trees") par(new=TRUE) plot(Girth, Height, axes=FALSE, bty="n", xlab="", ylab="", pch=3) axis(4) mtext("Height", side=4, line=3) @ \subsection{Controlling Graphics Devices} When we make a graph with the \texttt{plot} function, or other function, it will typically open a graphics window on the computer screen automatically; if we desire more control, we can use several functions to be more deliberate. We create new graphics ``devices'' or graphs in several ways, including the functions \texttt{windows()} (Microsoft Windows OS), \texttt{quartz()} (Mac OS), \texttt{x11()} (X11 Window system). For instance, to open a ``graphics device'' on a Mac computer that is 5 inches wide and 3 inches tall, we write <<>>= quartz(width=5, height=3) @ To do the same thing on a computer running Windows, we type <>= windows(width=5, height=3) @ To control the \emph{par}ameters of the graph, that is, what it looks like, aside from data, we use arguments to the \texttt{par} function. Many of these arguments refer to \emph{sides} of the graph. These a numbered 1--4 for the bottom X axis, the left side Y axis, the top, and the right side Y axis. Arguments to \texttt{par} are many (see \texttt{?par}), and include the following. \begin{description} \item[\texttt{mar}] controls the width of margins on each side; units are number of lines of text; defaults to c(5, 4, 4, 2) + 0.1, so the bottom has the most room, and the right hand side has the least room. \item[\texttt{mgp}] controls the spacing of the axis title, labels and the actual line itself; units of number of lines of text, and default to c(3, 1, 0), so the axis title sits three lines away from the edge of the plotting region, the axis labels, one line away and the axis line sits at the edge of the plotting region. \item[\texttt{tcl}] tick length, as a fraction of the height of a line of text; negative values put the tick marks outside, positive values put the tick marks inside. Defaults to -0.5. \end{description} We can build each side of the graph separately by initiating a graph but not plotting axes \texttt{plot(..., axes = FALSE)}, and then adding the axes separately. For instance, \texttt{axis(1)} adds the bottom axis. Last, we can use \texttt{layout} to make graph with several smaller subgraphs (see also (\texttt{mfrow} and \texttt{mfcol} arguments to \texttt{par} and the function \texttt{split.screen}). The function \texttt{layout} takes a matrix as its argument, the matrix contains a sequence of numbers that tells \R~how to fill the regions Graphs can fit in more than one of these regions if indicated by the same number. Here we create a compound graphic organized on top of a $4 \times 4$ grid; it will have two rows, will be be filled in by rows. The first graph will be the upper left, the second the upper right, and the third will fill the third and fourth spots in the second. We will fill each with a slightly different plot of the same data (Fig. \ref{fig:par}). <<>>= quartz(,5,5) layout( matrix( c(1,2,3,3), nrow=2, byrow=TRUE) ) plot(Girth, Height) @ Now we add the second and third ones but with different settings. <<>>= par(mar=c(3,3,1,1), mgp=c(1.6, .2, 0), tcl=.2) plot(Girth, Height) par(mar=c(3,3,2,1), mgp=c(1.6, .2, 0), tcl=.2) plot(Girth, Height, axes=FALSE, xlim=c(8,22) ) axis(1, tcl=-.3) axis(2, tick=F) rug(Height, side=2, col=2) title("A Third, Very Wide, Plot") <>= dev.print(pdf, 'MyTrees4.pdf') @ \begin{figure}[ht] \centering \includegraphics[width=12cm]{MyTrees4.pdf} \caption{A variety of examples with different graphics \emph{par}ameters.} \label{fig:par} \end{figure} \subsection{Creating a Graphics File} Now that you have made this beautiful thing, I suppose you would like to stick it into a manuscript. One way to get graphics out of \R~and into something else (presentation software, a manuscript), is to create a graphics device, and then save it with \texttt{dev.print} in\ a format that you like, such as PDF, postscript, PNG, or JPEG. For instance, we might do this to save a graphics file in our working directory. <>= getwd() quartz(,4,4) plot(Height, Volume, main="Tree Data") dev.print(pdf, "MyTree.pdf") @ This should have saved a small PDF figure in your current working directory, returned by {\tt getwd}. \emph{You will have to find your own way to make graphics files that suits your operating system, your preferred applications, and your personality.} \section{Graphical displays that show distributions} Here we take a quick look at ways to reveal distributions of data. First, two views to see in the Console, a six number summary of quantiles and the mean, and the good ol' stem and leaf plot, a favorite of computational botanists everywhere. <<>>= summary(Girth) stem(Girth) @ Here we will create 4 various plots revealing different ways to look at your data, each with a couple bells and whistles. For kicks, we put them into a single compound figure, in a ``layout'' composed of a matrix of graphs. <>= quartz(,6,7) <<>>= layout( matrix( c(1,2,2,3,4,4), nrow=2, byrow=TRUE ) ) plot(1:length(Girth), Girth, xlab="Order of Sample Collection?") hist(Girth, prob=TRUE) rug(Girth) lines(density(Girth)) boxplot(Girth, main="Boxplot of Girth") points(jitter( rep(1,length(Girth)) ), Girth) qqnorm(log(Girth)) qqline(log(Girth)) title(sub="Log transformed data") <>= dev.print(pdf, "Distributions.pdf") @ \begin{figure}[ht] \centering \includegraphics[width=\textwidth]{Distributions} \caption{Examples of ways to look at the distribution of your data. See \texttt{?hist}, for example, for more information.} \label{fig:dists} \end{figure} \section{Eigenanalysis} Performing eigenanalysis in \R~is easy. We use the \texttt{eigen} function which returns a list with two components. The first named component is a vector of eigenvalues and the second named component is a matrix of corresponding eigenvectors. These will be numeric if possible, or complex, if any of the elements are complex numbers. Here we have a typical demographic stage matrix. <<>>= A <- matrix(c(0, .1, 10, .5), nrow=2) eig.A <- eigen(A) str(eig.A) @ Singular value decomposition (SVD) is a generalization of eigenanalysis and is used in \R for some applications where eigenanalysis was used historically, but where SVD is more numerically accurate (\texttt{prcomp} for principle components analysis). \section{Eigenanalysis of demographic versus Jacobian matrices} Eigenanalyses of demographic and Jacobian matrices are worth comparing. In one sense, they have similar meanings --- they both describe the asymptotic (long-term) properties of a system, either population size (demographic matrix) or a perturbation at an equilibrium. The quantitative interpretation of the eigenvalues will therefore differ. In the case of the stage (or age) structured demographic model, the elements of the demographic matrix are discrete per capita increments of change over a specified time interval. This is directly analogous to the finite rate of increase, $\lambda$, in discrete unstructured models. (Indeed, an unstructured discrete growth model is a stage-structured model with one stage). Therefore, the eigenvalues of a demographic matrix will have the same units --- a per capita increment of change. That is why the dominant eigenvalue has to be greater than 1.0 for the population to increase, and less than 1 (not merely less than zero) for the population to decline. In the case of the Jacobian matrix, comprised of continuous partial differential equations, the elements are per capita \emph{instantaneous} rates of change. As differential equations, they describe the \emph{instantanteous} rates of change, analogous to $r$. Therefore, values greater than zero indicate increases, and values less than zero indicate decreases. Because these rates are evaluated at an equilibrium, the equilibrium acts like a new zero --- positive values indicate growth away from the equilibrium, and negative values indicate shrinkage back toward the equilibrium. When we evaluate these elements at the equilibrium, the numbers we get are in the same units as $r$, where values greater than zero indicate increase, and values less than zero indicate decrease. The change they describe is the instantaneous per capita rate of change of each population with respect to the others. The eigenvalues summarizing all of the elements Jacobian matrix thus must be less than zero for the disturbance to decline. So, in summary, the elements of a demographic matrix are discrete increments over a real time interval. Therefore its eigenvalues represent relative per capita growth rates a discrete time interval, and we interpret the eigenvalues with respect to 1.0. On the other hand, the elements of the Jacobian matrix are instantaneous per captia rates of change evaluated at an equilibrium. Therefore its eigenvalues represent the per capita \emph{instantaneous} rates of change of a tiny perturbation at the equilibrium. We interpret the eigenvalues with respect to 0 indicating whether the perturbation grows or shrinks. \section{Symbols used in this book} \index{$\Alpha$@greek symbols} I am convinced that one of the biggest hurdles to learning theoretical ecology --- and the one that is easiest to overcome --- is to be able to ``read'' and hear them in your head. This requires being able to pronounce Greek symbols. Few of us learned how to pronounce ``$\alpha$'' in primary school. Therefore, I provide here an incomplete simplistic American English pronunciation guide for (some of) the rest of us, for symbols in this book. Only a few are tricky, and different people will pronounce them differently. \begin{table}[hb] \centering \caption{Symbols and their pronunciation; occasional usage applies to lowercase, unless otherwise specified. A few symbols have common variants. Any symbol might be part of any equation; ecologists frequently ascribe other meanings which have to be defined each time they are used. See also http://en.wikipedia.org/wiki/Greek\_letters} \begin{tabular}[c]{llp{10cm}} \hline \hline Symbol~ & Spelling & Pronunciation; occasional or conventional usage \\ \hline A, $\alpha$ & alpha & al'-fa; point or local diversity (or a parameter in the logseries abundance distribution)\\ B, $\beta$ & beta & bay'-ta; turnover diversity \\ $\mathrm{\Gamma}$, $\gamma$ & gamma & gam'-ma; regional diversity \\ $\mathrm{\Delta}$, $\delta$, $\partial$ & delta & del'-ta; change or difference \\ E, $\epsilon$, $\varepsilon$ & epsilon & ep'-si-lon; error \\ $\mathrm{\Theta}$, $\theta$ & theta & thay'-ta (``th'' as in ``thanks''); in neutral theory, biodiversity. \\ $\mathrm{\Lambda}$, $\lambda$ & lambda & lam'-da; eigenvalues, and finite rate of increase \\ M, $\mu$ & mu & meeoo, myou; mean \\ N, $\nu$ & nu & noo, nou \\ $\mathrm{\Pi}$, $\pi$ & pi & pie; uppercase for product (of the elements of a vector) \\ P, $\rho$ & rho & row (as in ``a boat''); correlation \\ $\mathrm{\Sigma}$, $\sigma$, $\varsigma$ & sigma & sig'-ma; standard deviation (uppercase is used for summation)\\ T, $\tau$ & tau & (sounds like what you say when you stub your toe - ``Ow!'' but with a ``t''). \\ $\Phi$, $\phi$ & phi & fie, figh \\ X, $\chi$ & chi & kie, kigh \\ $\mathrm{\Psi}$, $\psi$ & psi & sie, sigh \\ $\mathrm{\Omega}$, $\omega$ & omega & oh-may'-ga; degree of omnivory \\ \hline \end{tabular} \end{table}