Sunday, March 26, 2017

Basic Debugging in R

For this posting we will take a look at some basic debugging techniques with R Studio and the R language for a provided function containing at least one bug. The provided code is:

tukey_multiple <- function(x) { 
   outliers <- array(TRUE,dim=dim(x)) 
   for (j in 1:ncol(x)) 
    { 
    outliers[,j] <- outliers[,j] && tukey.outlier(x[,j]) 
    } 
outlier.vec <- vector(length=nrow(x)) 
    for (i in 1:nrow(x)) 
    { outlier.vec[i] <- all(outliers[i,]) } return(outlier.vec) }



Copied and pasted directly into R Studio the provided function does not run, producing an error message:


The error message gives us some hint as to where it is in the code but it does not include a line number like many programming environments would. We can eventually match up the error in the console window with lines 8 & 9 but curiously R Studio is not showing all of line 9, cutting it off at the keyword 'return.'

This error appears to be due to R not liking the 'return' keyword being located on the same line as the code used in the 'for loop' because as soon as we drop the return(outlier.vec) to its own line the function is stored into R as a valid function call without any errors or warnings:


This may seem like a simple thing that would not have necessarily bothered a language such as C but with C there is an explicit character that end a set of instructions, the semicolon (;), and that semicolon is mandatory. The R language also uses a semicolon to end a set of instructions but it is optional and the R interpreter will also accept a newline as separating a set of instructions (see this reference from the R-Project.org website for details). To verify this we can go back to our original code with 'return(outlier.vec)' on line #9 again but insert a semicolon between return and the closing brace from the for loop. That now also is interpreted without error by R Studio:



That said, it is not good readability to have the return placed in that manner but we have identified why the error was occurring, identified two solutions, and created in the first correction a more readable and properly executing set of code for the function 'tukey_multiple()'.

With that complete it would appear that the function is all set. Can we feed it some data to confirm proper execution and correct results? Inspecting the code it appears that the function expects to receive a multidimensional array of multiple rows with multiple columns. We can create a test array of 2 columns and 20 rows with random numbers:



And then feed this to the tukey_multiple() function we just fixed and see what the output is:


Again no line numbers to help us out. Good thing this is not a large function with dozens or hundreds of lines to review but we can do a quick search for 'tukey.outlier' and discover it on line #5. It looks like tukey.outlier is a function call and not a function that is part of any loaded library and that the results from this function call are being logically AND'ed to the value in outliers[,j] before being assigned to outliers[,j]. To finish testing this function we'd need the tukey.outlier() function along with test data and the matching results to confirm the function is outputing valid results.

Sunday, March 12, 2017

Visualization and Graphics in R


Research and development in the area of data visualization has been increasing as the volume, variety, and velocity of data collection and aggregation as also been increasing. This research involves study of the human consumer of the visualization and the associated nuances of psychology of how people interpret and react to visual information as well as the computer and mathematical sciences involved in turning large volumes of data into visual depictions.

In "The Psychology of Visualization" by Andrew Csigner from the Department of Computer Science, University of British Columbia in November 1992 the author reviewed literature of that time on visualization processes and display generation. Challenges to user's ability to take in data due to divided attention, care in selection of colors, and limits of people's perception ability to take in visual data. More recently Scientific American published an article summarizing further research on data visualization and human emotions, drawing heavily from the "Atlas of Emotions" created by psychologist Paul Ekman.

There are several libraries and packages available in R for plotting and visualizing data. This post will take into consideration the basic "no-frills" graphics package installed by default with R as well as two of the later packages designed to give the programmer finer tuned control over the visualization but that also incorporate, by default, some of the benefits of the research on improved visualizations that better convey data to the end user.

‘R Graphics’ by Paul Murrell, first published in 2005, is cited in several books as one of the first definitive books on the R Graphics systems. In the second edition published in 2011 the author discusses two distinct graphics systems in R: the traditional graphics system and the grid graphics system. The grid graphics system is more powerful than the traditional system and became the foundation for the lattice package and ggplot2. The lattice and ggplot2 packages use grid’s low-level general purpose graphics system to draw plots.

> library(graphics)

The traditional graphics system includes the capability to product complete scatterplots, histograms, and boxplots, among others. The traditional graphics system are within the “graphics” library which is loaded automatically in the standard installation of R.

plot() will accept a single numeric vector, a factor, or a one-dimensional table to produce a scatterplot or barplot. There is a special method in plot() for agnes objects that produces plots related to agglomerative hierarchical clustering analysis. Additional high level traditional graphics methods for single variables include barplot(), pie(), dotchart(), boxplot(), hist(), stripchart(), and stem().

For plotting data of two variables plot() is also capable of handling a pair or numeric vectors, one numeric vector, and a factor, two factors, a list of two vectors or factors, a 2-D table, a matrix, or a data frame with two columns. The plot() method with two variables can directly produce a scatterplot, stripchart, boxplot, spineplot, or a Mosaic plot. Additional traditional graphics methods include sunflowerplot(), smoothScatter(), boxplot(), barplot(), dotchart(), stripchart(), spineplot(), cdplot(), fourfoldplot(), assocplot(), mosaicplot().

The traditional graphics library methods can also handle more than two variables and produce additional forms of graphs and visualization. The programmer can also extend the traditional graphics library further on their own.

> library(grid)

The grid graphics system is separate from the traditional graphics system but it does co-exist in R with the traditional system. That said, the two graphics systems do not interact well together. Grid works completely separate from the traditional system. It is possible to create output from both the grid system and the traditional system on the same page it is not supported for one method to provide output to another in any reliable manner. While grid provides the low-level functionality it is necessary to use a higher level library to create a method for drawing complete plots; grid is typically not used by programmers on its own. Deepayan Sarkar’s lattice library is one such high level package and Hadley Wickham’s ggplot2 is the other high level package, both of which are in popular and common use.

> library(lattice)

The capabilities in lattice seem very similar to the traditional graphics system’s so why did the community feel there was a need for an additional library? According to Murrel, lattice incorporated refinements in presentation of visualizations that took into account research on visual perception experiments. This made the default use of colors, axis labeling, grid tick marks and labeling, and legends easier for users to take in visually and understand compared to the way the traditional graphics system displayed similar plots. The careful selection of default values built in to lattice plots is one of the recognized core features. Combined with the fine grain control lattice makes available to the programmer to change these defaults and further customize the appearance of the plot makes lattice a significant improvement over the traditional graphics system.

> library(ggplot2) 

The double g’s in ggplot2 is a nod to Leland Wilkinson’s book, ‘The Grammar of Graphics.’ The ggplot2 package is built on grid but it is completely separate from both the traditional graphics system and the lattice package. The ggplot2 library, like the lattice library, incorporates features developed from an understanding of how to maximize plots and visualizations to convey information to a user. Some users may prefer ggplot2 visualizations of the same graph over lattice, or not, simply based on style choices that differ between the two systems and ggplot2 offers some further control and enhancements on arranging complex plots and legends over lattice.

Comparing these three graphics as displayed of one set of simple variable data we can see the below differences for the basic graphics, the lattice library, and the ggplot2 library, with some of the enhancements visible as we progress from one system to the next.

Basic Plot of Pressure/Temperature

lattice plot of Pressure/Temperature (default settings)

lattice xyplot of Pressure/Temperature (added title, dashed line between points, changed point plot type)

ggplot2 qplot of Pressure/Temperature (default settings)
ggplot2 qplot of Pressure/Temperature (added title, line between points)
The above can be viewed as an evolution of improving visualizations, with each one being easier to interpret the presented data and also extrapolate other values between the recorded data points thanks to the grid lines and the connecting lines between plotted points. The default use of colors in the lattice plot is more appealing and easier on my eyes compared to the ggplot2 or the basic graphics plot but that could be a simple personal preference and ggplot2 does have the option to change colors. The difference being that lattice seems to do a better job at taking color factors into consideration by default. Matters of color, scaling selection, offering of reference grids, and legends all aid in reducing the amount of mental work a user has to do in order to understand and interpret a visualization.


Further References:

Paul Murrell’s University of Auckland homepage, https://www.stat.auckland.ac.nz/~paul/
"Challenges in Visual Data Analysis", DA Keim, F. Mansmann, J. Schneidewind, H. Ziegler; IEEE Xplore, 24 July 2006.