Monday, January 23, 2017

Exploring and Manipulating Vector, Matrix, and Data Frame Types in R

In R a list or a vector can be created with the constructor function, c(). To explore the constructor function we can create three vectors containing made-up polling results for a collection of political candidates from two networks:



After executing these three statements we can look in the R Studio Environment window and confirm the vectors are associated with the data types (numeric and character) as intended.



Our intention was to create a vector but can we confirm this somehow? The R language includes a family of functions that test whatever is passed to them so we can try is.vector(), is.data.frame(), and is.list() to see what R returns for each of these tests.


There is a large list of is.___() functions the come with the base installation of R and from the R console’s autocomplete functionality the different functions can be scrolled through by typing “is.” and pausing in the console for a scroll window to appear:

With these three individual vectors we can create a single data frame to tie the network polling results together with the respective candidates with the data.frame() function.


And as with the vectors, it is possible to quickly inspect the success of this statement in the R Environment window:



The data.frame() function presumes the names for the data frame elements are simply the names of the vectors. This inheritance naming methodology is common in R as functions create new vectors, data frames, and other structures.

If we want to calculate the mean value for individual candidates we cannot simply apply the mean() function to the polling data frame. Each candidate is represented on a single row so if we want the mean value for each candidate perhaps there is an existing function that calculates the mean of a row’s values? And we can pass that function the columns to perform the mean over so it does not attempt to use the Name character string column? The rowMeans() function meets these needs, where we pass it the row number and then the column range. In this data the first row is Jeb and columns 2 & 3 contain the polling values.



When we do not specify a row number and instead leave the value blank R presumes the user is seeking to use the entire data frame. So when we execute rowMeans(polling[,2:3]) the function returns the average for columns 2 & 3 of each row. It is also possible to insert these average values back into the polling data frame, creating a new column name in the process, with polling$average <- rowMeans(polling[,2:3]).

These average polling results can be plotted with qplot(polling$Name, polling$average) to create a point plot like below:


Point plots for data such as this are not easy to read, however, and are not an effective visualization of the descriptive statistics in a case such as this and a bar plot would be easier to take in visually. Such a bar plot can be created using barplot(polling$average, names.arg=polling$Name):


No comments:

Post a Comment