Data Science and Visualization Explorations: January 2017

Monday, January 23, 2017

Exploring and Manipulating Vector, Matrix, and Data Frame Types in R

In R a list or a vector can be created with the constructor function, c(). To explore the constructor function we can create three vectors containing made-up polling results for a collection of political candidates from two networks:

After executing these three statements we can look in the R Studio Environment window and confirm the vectors are associated with the data types (numeric and character) as intended.

Our intention was to create a vector but can we confirm this somehow? The R language includes a family of functions that test whatever is passed to them so we can try is.vector(), is.data.frame(), and is.list() to see what R returns for each of these tests.

There is a large list of is.___() functions the come with the base installation of R and from the R console’s autocomplete functionality the different functions can be scrolled through by typing “is.” and pausing in the console for a scroll window to appear:

With these three individual vectors we can create a single data frame to tie the network polling results together with the respective candidates with the data.frame() function.

And as with the vectors, it is possible to quickly inspect the success of this statement in the R Environment window:

The data.frame() function presumes the names for the data frame elements are simply the names of the vectors. This inheritance naming methodology is common in R as functions create new vectors, data frames, and other structures.

If we want to calculate the mean value for individual candidates we cannot simply apply the mean() function to the polling data frame. Each candidate is represented on a single row so if we want the mean value for each candidate perhaps there is an existing function that calculates the mean of a row’s values? And we can pass that function the columns to perform the mean over so it does not attempt to use the Name character string column? The rowMeans() function meets these needs, where we pass it the row number and then the column range. In this data the first row is Jeb and columns 2 & 3 contain the polling values.

When we do not specify a row number and instead leave the value blank R presumes the user is seeking to use the entire data frame. So when we execute rowMeans(polling[,2:3]) the function returns the average for columns 2 & 3 of each row. It is also possible to insert these average values back into the polling data frame, creating a new column name in the process, with polling$average <- rowMeans(polling[,2:3]).

These average polling results can be plotted with qplot(polling$Name, polling$average) to create a point plot like below:

Point plots for data such as this are not easy to read, however, and are not an effective visualization of the descriptive statistics in a case such as this and a bar plot would be easier to take in visually. Such a bar plot can be created using barplot(polling$average, names.arg=polling$Name):

Sunday, January 15, 2017

R Language: Importing Data and Performing Basic Descriptive Statistical Measures

R provides functions for reading in data locally from a file or remotely over a network. The specifics of the read.csv() function documentation is accessible from R Studio by typing in help(read.csv) in the Console window or from the Help tab. From here we learn that read.csv is a version of the read.table function and a sister function to read.csv2, read.delim, and read.delim2, all of which read in a file and create a data frame.

read.csv(file, header = TRUE, sep = ",", quote = "\"",

dec = ".", fill = TRUE, comment.char = "", ...)

The file object is an abstraction that allows for passing more than just a local csv or text file for reading. When we look up ‘file’ in Help the documentation advises “Functions to create, open and close connections, i.e. ‘generalized files’, such as possibly compressed files, URLs, pipes, etc.” This means we can pass read.csv a local file, a URL, pipe, socket, or a compressed file of type GZ, BZ, XZ, or ZIP. The ability to directly process compressed files is attractive as it saves disk storage space, especially since text based csv files often contain a lot of redundancies that compress efficiently.

Using the American Community Survey data sample for Oregon referenced in the assignment we can directly import that CSV formatted data over HTTP into R and the read.csv function will attempt to understand the formatting and header rows and populate a data frame. To assign the data frame to a variable we can use the ‘<-‘ symbol combination or ‘=’. Which to use is a matter of style preference but the R user community seems to have a preference for assignments to be made with ‘<-‘ so as not to confuse equality checks performed with ‘=’.

Reading in the file R Studio provides a quick method to inspect the data as it was interpreted from the Environment tab. From this it can be seen that 7811 rows (observations) of data consisting of 14 variables per row have been assigned to the data frame named ‘acs’ with a preview of each column variable and the data type R determined that variable type (int, factor, etc):

Clicking directly on the variable from the Environment tab will display the entire data frame in table format in a new tab.

Another function in R to quickly inspect and review a data frame is summary(), responding with basic statistical measures on each column/variable:

Having confirmed that the data is read in, assigned column variable names, and the types for each column seem correct it is possible to begin exploring and manipulating the data. For example, all of the values for number_children can be accessed by first specifying the dataframe variable name followed by a $ and the name of the column/variable:

The summary(acs) output already displayed the mean value for number_children but it is also possible to directly calculate it using the mean() function:

It is also possible to directly calculate the mode, variance, and standard deviation:

The response for mode() was “numeric” but this seems odd. The mode should return the value that occurs most often in the dataset and for the values in the vector acs$number_children seems like there probably is one value occurring more than others. A definitive answer can be found by plotting the vector in a histogram:

From this it can be seen that the value ‘0’ is the most frequently occurring value making 0 the mode of acs$number_children. So what is going on here? Does the mode() function actually refer to some other than the statistical measure of mode? From within the console window’s autocomplete function it can be seen that the function mode() currently in the R environment will return the type or storage mode of an object, which explains the response of ‘numeric’ when passing it the vector acs$number_children. Looking up mode in the Help window provides further details of this mode() function.

It appears that there is no default R language function for the statistical mode but certainly there is an existing implementation in a R package that could be included into the current environment. A quick search reveals the package DescTools and looking through the documentation, available from https://cran.r-project.org/web/packages/DescTools/DescTools.pdf , indicates Mode() calculated the most frequent value of a variable. This package can be installed from the console with install.packages() and then needs to be explicitly included as a library in the current environment with library

This library’s version of mode requires calling it with Mode(), not to be confused with mode():

Monday, January 9, 2017

Setting Up for Introduction to R Programming

1. Open account on GitHub.com and share the location of your repository.

https://github.com/patrick-usf/usf-r

2. Open account on any blog sites (blogger.com, wordpress.com, or blog.usf.edu), and share the site.

3. Download R console and RStudio; Installed R v3.3.1 and RStudio v1.0.136