Data Science and Visualization Explorations

Sunday, March 26, 2017

Basic Debugging in R

For this posting we will take a look at some basic debugging techniques with R Studio and the R language for a provided function containing at least one bug. The provided code is:

tukey_multiple <- function(x) {
   outliers <- array(TRUE,dim=dim(x))
   for (j in 1:ncol(x))
    {
    outliers[,j] <- outliers[,j] && tukey.outlier(x[,j])
    }
outlier.vec <- vector(length=nrow(x))
    for (i in 1:nrow(x))
    { outlier.vec[i] <- all(outliers[i,]) } return(outlier.vec) }

Copied and pasted directly into R Studio the provided function does not run, producing an error message:

The error message gives us some hint as to where it is in the code but it does not include a line number like many programming environments would. We can eventually match up the error in the console window with lines 8 & 9 but curiously R Studio is not showing all of line 9, cutting it off at the keyword 'return.'

This error appears to be due to R not liking the 'return' keyword being located on the same line as the code used in the 'for loop' because as soon as we drop the return(outlier.vec) to its own line the function is stored into R as a valid function call without any errors or warnings:

This may seem like a simple thing that would not have necessarily bothered a language such as C but with C there is an explicit character that end a set of instructions, the semicolon (;), and that semicolon is mandatory. The R language also uses a semicolon to end a set of instructions but it is optional and the R interpreter will also accept a newline as separating a set of instructions (see this reference from the R-Project.org website for details). To verify this we can go back to our original code with 'return(outlier.vec)' on line #9 again but insert a semicolon between return and the closing brace from the for loop. That now also is interpreted without error by R Studio:

That said, it is not good readability to have the return placed in that manner but we have identified why the error was occurring, identified two solutions, and created in the first correction a more readable and properly executing set of code for the function 'tukey_multiple()'.

With that complete it would appear that the function is all set. Can we feed it some data to confirm proper execution and correct results? Inspecting the code it appears that the function expects to receive a multidimensional array of multiple rows with multiple columns. We can create a test array of 2 columns and 20 rows with random numbers:

And then feed this to the tukey_multiple() function we just fixed and see what the output is:

Again no line numbers to help us out. Good thing this is not a large function with dozens or hundreds of lines to review but we can do a quick search for 'tukey.outlier' and discover it on line #5. It looks like tukey.outlier is a function call and not a function that is part of any loaded library and that the results from this function call are being logically AND'ed to the value in outliers[,j] before being assigned to outliers[,j]. To finish testing this function we'd need the tukey.outlier() function along with test data and the matching results to confirm the function is outputing valid results.

Sunday, March 12, 2017

Visualization and Graphics in R

Research and development in the area of data visualization has been increasing as the volume, variety, and velocity of data collection and aggregation as also been increasing. This research involves study of the human consumer of the visualization and the associated nuances of psychology of how people interpret and react to visual information as well as the computer and mathematical sciences involved in turning large volumes of data into visual depictions.

In "The Psychology of Visualization" by Andrew Csigner from the Department of Computer Science, University of British Columbia in November 1992 the author reviewed literature of that time on visualization processes and display generation. Challenges to user's ability to take in data due to divided attention, care in selection of colors, and limits of people's perception ability to take in visual data. More recently Scientific American published an article summarizing further research on data visualization and human emotions, drawing heavily from the "Atlas of Emotions" created by psychologist Paul Ekman.

There are several libraries and packages available in R for plotting and visualizing data. This post will take into consideration the basic "no-frills" graphics package installed by default with R as well as two of the later packages designed to give the programmer finer tuned control over the visualization but that also incorporate, by default, some of the benefits of the research on improved visualizations that better convey data to the end user.

‘R Graphics’ by Paul Murrell, first published in 2005, is cited in several books as one of the first definitive books on the R Graphics systems. In the second edition published in 2011 the author discusses two distinct graphics systems in R: the traditional graphics system and the grid graphics system. The grid graphics system is more powerful than the traditional system and became the foundation for the lattice package and ggplot2. The lattice and ggplot2 packages use grid’s low-level general purpose graphics system to draw plots.

> library(graphics)

The traditional graphics system includes the capability to product complete scatterplots, histograms, and boxplots, among others. The traditional graphics system are within the “graphics” library which is loaded automatically in the standard installation of R.

plot() will accept a single numeric vector, a factor, or a one-dimensional table to produce a scatterplot or barplot. There is a special method in plot() for agnes objects that produces plots related to agglomerative hierarchical clustering analysis. Additional high level traditional graphics methods for single variables include barplot(), pie(), dotchart(), boxplot(), hist(), stripchart(), and stem().

For plotting data of two variables plot() is also capable of handling a pair or numeric vectors, one numeric vector, and a factor, two factors, a list of two vectors or factors, a 2-D table, a matrix, or a data frame with two columns. The plot() method with two variables can directly produce a scatterplot, stripchart, boxplot, spineplot, or a Mosaic plot. Additional traditional graphics methods include sunflowerplot(), smoothScatter(), boxplot(), barplot(), dotchart(), stripchart(), spineplot(), cdplot(), fourfoldplot(), assocplot(), mosaicplot().

The traditional graphics library methods can also handle more than two variables and produce additional forms of graphs and visualization. The programmer can also extend the traditional graphics library further on their own.

> library(grid)

The grid graphics system is separate from the traditional graphics system but it does co-exist in R with the traditional system. That said, the two graphics systems do not interact well together. Grid works completely separate from the traditional system. It is possible to create output from both the grid system and the traditional system on the same page it is not supported for one method to provide output to another in any reliable manner. While grid provides the low-level functionality it is necessary to use a higher level library to create a method for drawing complete plots; grid is typically not used by programmers on its own. Deepayan Sarkar’s lattice library is one such high level package and Hadley Wickham’s ggplot2 is the other high level package, both of which are in popular and common use.

> library(lattice)

The capabilities in lattice seem very similar to the traditional graphics system’s so why did the community feel there was a need for an additional library? According to Murrel, lattice incorporated refinements in presentation of visualizations that took into account research on visual perception experiments. This made the default use of colors, axis labeling, grid tick marks and labeling, and legends easier for users to take in visually and understand compared to the way the traditional graphics system displayed similar plots. The careful selection of default values built in to lattice plots is one of the recognized core features. Combined with the fine grain control lattice makes available to the programmer to change these defaults and further customize the appearance of the plot makes lattice a significant improvement over the traditional graphics system.

> library(ggplot2)

The double g’s in ggplot2 is a nod to Leland Wilkinson’s book, ‘The Grammar of Graphics.’ The ggplot2 package is built on grid but it is completely separate from both the traditional graphics system and the lattice package. The ggplot2 library, like the lattice library, incorporates features developed from an understanding of how to maximize plots and visualizations to convey information to a user. Some users may prefer ggplot2 visualizations of the same graph over lattice, or not, simply based on style choices that differ between the two systems and ggplot2 offers some further control and enhancements on arranging complex plots and legends over lattice.

Comparing these three graphics as displayed of one set of simple variable data we can see the below differences for the basic graphics, the lattice library, and the ggplot2 library, with some of the enhancements visible as we progress from one system to the next.

Basic Plot of Pressure/Temperature

lattice plot of Pressure/Temperature (default settings)

lattice xyplot of Pressure/Temperature (added title, dashed line between points, changed point plot type)

ggplot2 qplot of Pressure/Temperature (default settings)

ggplot2 qplot of Pressure/Temperature (added title, line between points)

The above can be viewed as an evolution of improving visualizations, with each one being easier to interpret the presented data and also extrapolate other values between the recorded data points thanks to the grid lines and the connecting lines between plotted points. The default use of colors in the lattice plot is more appealing and easier on my eyes compared to the ggplot2 or the basic graphics plot but that could be a simple personal preference and ggplot2 does have the option to change colors. The difference being that lattice seems to do a better job at taking color factors into consideration by default. Matters of color, scaling selection, offering of reference grids, and legends all aid in reducing the amount of mental work a user has to do in order to understand and interpret a visualization.

Further References:

Paul Murrell’s University of Auckland homepage, https://www.stat.auckland.ac.nz/~paul/

"Challenges in Visual Data Analysis", DA Keim, F. Mansmann, J. Schneidewind, H. Ziegler; IEEE Xplore, 24 July 2006.

Sunday, February 26, 2017

File I/O, Transforms, and Filtering

File input and output in the R language has many functions that trace to read.table(). Data files commonly use comma separated value (csv) format but other separator characters are also found in use, such as tab separated and pipe separated, and the read.table()function provides the option to specify custom separator characters for read or writing file formats. The read.table() function can also directly read in a file over a network over the HTTP protocol when provided a url.

If the programmer does not want to supply the R script with a filename and path the file.choose() function can be nested within the read.table() or read.csv() function. The file.choose() function will provide an operating system file selection dialog to browse to a file which then gets passed to read.table() or read.csv() for processing.

The partner function to read.table() is write.table(), with the notable exception that write.table cannot write a file over HTTP or PUT a file.

To explore file I/O and to experiment with the Plyr library we will read in a provided file, determine the average test score broken down by sex for the data set and then write the resulting table to a new file.

Then we will filter the data set based on names containing the letter 'i' (in either case) and write the subset of data based on that filter to a new file.

hw8 <- read.csv(file.choose())

# 1) import dataset.txt

# 2) Run a mean using Sex as the category (use plyr package for this operation),

# then write the resulting output to a file.

# 3) test the DataSet.txt as a dataframe for names whos name contains the

# letter i, then create a new data set with those names, Write those names

# to a file separated by comma’s (CSV)

library(plyr)

mean_by_sex = ddply(hw8, "Sex", transform, Grade.Average=mean(Grade))

write.table(mean_by_sex, file="Mean_by_sex.txt")

the_i_s = subset(hw8, grepl("[iI]", Name))

write.table(the_i_s, file="contains_i.csv", sep=",")

The results of using the ddply() function transform combined with the mean function applied to the column of Grade values selected by the column of Sex is the below table:

Using the R language implementation of the grep function with the regular expression to match upon either the upper or lower case letter 'i" is the below table:

These two tables were output to file in different formats, however, with the average score table output directly as a raw table and the "i" filtered table output in csv format. The contents of these files were viewed to confirm the format as written to disk:

S3 and S4 Object Oriented Principles in R

The R language has three object oriented systems: S3, S4, and Reference Classes (also referred to as R5). A class refers to a type of object and attributes or values describing that specific object and every object must be a member of a class. A method refers to a function associated with an object of a given class. In this post I shall focus only upon S3 and S4 and will not touch upon R5/Reference Classes.

To determine if a given object is of S3 or S4 type there are two common function calls that can be used to determine the object's type. The function isS4() will return TRUE if the supplied object is S4 and false if it is anything other than S4. The function otype(), which is part of the Pryr package, however is more robust since it is not a simple true/false test for S4 but rather returns the object type definitively.

In the above example churn is a valid S3 object, inv101 was a valid S4 object, and inv501 was a simple vector array of three numbers. Since isS4() is simply providing a TRUE or FALSE response for whether or not the provided object is S4 it is not able to differentiate between a vector array and a S3 object while otype() does differentiate.

Determining the base type of an object can be found with a call to typeof(). Calling typeof() with the object itself will return S4 as well, making it another means to determine if an object is S4, but like isS4() it does not positively identify S3 and instead indicates a provided S3 object is of type 'list.' Passing typeof() the class member name will return the base type:

In their 1988 First International Joint Conference of ISSAC paper (http://stepanovpapers.com/genprog.pdf) titled “Generic Programming” David Musser and Alexander Stepanov defined generic programming as centering “around the idea of abstracting from concrete, efficient algorithms to obtain generic algorithms that can be combined with different data representations to produce a wide variety of useful software.” A generic function does not require explicit arguments to be passed to it in order for it to execute. There may be other specific methods or functions which operate differently when additional arguments are supplied to a function call but in a generic function the function can be called without specifying arguments or concern for the different object types. When the function is called the R interpreter attempts to route the call to a function defined for the class of the objects in the function call.

For example, the functions plot() and print() are generic functions and the specific methods available to each generic function can be found by calling the methods() function with the generic function. Calling methods(plot) reveals the methods available to plot() which could be selected by the R interpreter for plotting a supplied set of arguments based on decisions made by the interpreter based on the objects and values present. For example, a generic method would test supplied data or objects to see if it contained a matrix and if so take one set of actions, take a different set of actions if it contained a data.frame, etc.

This abstraction helps simplify aspects of programming and use of functions but it can introduce unexpected behavior and incorrect or misleading results. S3 implemented an unchecked form of object oriented design and S4 was introduced to correct these shortcomings with a stricter design more akin to other existing object oriented systems. Despite the stricter nature of S4, or possibly because of this stricter nature, S3 remains the most commonly used object oriented system in the R language.

S4 requires formal definition of classes and class inheritance while S3 allows the programmer to turn a data frame or a list into a class by simply adding the class attribute to the data frame. The other substantial difference in S4 is that a generic function can be dispatched to a method based on any number of argument classes and not just one as in S3.

In practice this help eliminate some of the inconsistencies or misleading results that would come from passing incorrect or bad arguments to an S3 object or method. In S3 the programmer can create a class and pass objects to the class for processing but if the programmer passes objects incompatible with the functions within that class the R interpreter will not necessarily raise any errors or warnings and simply return NULL values. In S4 the programmer specifically states class definitions for input and output and if a user passes incorrect data to the class function it will return an error.

For some examples I import a dataset of cellular churn data read in from a CSV file, assign a class to the data, and then explore some of the attributes and values as S3:

In the below shot of the R Studio values we can see that churn is now referred to as "Large customers" while before it had been a class it had been a collection of 20,000 observations. Calling class(churn) also added two new attributes to the data set, attr("class") and attr("row.names").

As we execute the commands after class(churn) from the earlier screenshot we can confirm how specific class members or variables are accessed via the $ operator and the results from our tests for being S3 or S4 and base type:

To explore the S4 objects I created a class and a validity function and then test the validity checker as well as similar tests as above as to S3 or S4 membership and base types:

This code creates a class "usedcars" which utilizes a validity function "checkCar" to make sure supplied values fit within the constraints of Price, Mileage, and Year we have specified. In the next code segment we perform some tests on the class, its slots, and then create an instance of the usedcars class with the formal call to the new() function:

The console output for executing these commands is:

The getClass() function returns the 'slots' of the class "usedcars", which can be useful for making sure that when values are being passed for insertion into a new instance of "usedcars" that the correct slot names are utilized. inv101 receives the new instance of "usedcars" and since there were no error messages returned to the console we can be confident that the correct data types were passed and the values were in the allowed ranges.

In the below code we test for S4 and object type, test the validity detection function of the "usedcars" class, and direct access values using the @ referential operator (which differs from the S3 $ operator):

The integrity check successfully caught the out of specification attempts to assign values when created inv102 and inv103 above. The input price for inv102 was input too high in error and both the price and year values for inv103 were in error.

Monday, February 13, 2017

Matrix Operations

In the R Language the transpose of a given matrix is returned by the t() function. For the given 6x17 matrix 'A' the transpose is the below 17x6 matrix by calling A.transpose <- t(A):

For the given 6x167 matrix 'B" the transpose is 167x6 and is too large to present in full here:

Multiplying a matrix by a vector is a simple and direct action, simply requiring that the supplied vector is of a matching dimension:

To calculate the inverse of a matrix the matrix needs to be a square matrix so the given A and B matrices supplied will not work and we will need to make a square matrix. We can use R to generate a square matrix, named 'C', of 10x10 dimensions populated by random numbers and then calculate the inverse of that matrix:

In the 'MASS' library, however, there is also an implementation of the Moore-Penrose generalized inverse of a matrix that can be used on non-square matrices. If we import the MASS library and then call this generalized inverse function, ginv(), we can calculate the inverse of the A or B matrix:

To calculate the determinant of a matrix the matrix, once again, needs to be a square matrix. For this we can use the same code we used in generate the C 10x10 matrix in order to generate a new matrix, D, of dimensions 10x10. The function det() will return the determinant:

Full Code:

# Find the values of transpose of a matrix, multiplying a matrix by a
# vector, inverse of a matrix, determinant of a matrix by using the
# following values:

# A=matrix(1:100, nrow=6)
# B=matrix(1:1000, nrow=6)

# 1. transponse of a matrix
# 2. multiply by a vector
# 3. inverse of a matrix
# 4. determinant of a matrix

A <- matrix(1:100, nrow=6)
B <- matrix(1:1000, nrow=6)

# matrix transpose can be performed with the 't()' function
A.transpose <- t(A)
B.transpose <- t(B)

# multiplying a matrix by a vector if the vector is of a matching
# dimension for the given matrix
A.vectormult <- A * c(9,9,9,9,9,9,9,9,4,4,4,4,4,4,4,4,4)
A.vectormult_fail <- A * c(9,9,9,9,9,9,9,9,4,4,4,4,4,4,4,4,4,5)

# the inverse of a matrix is found with the solve() function but it
# can only be calculated on square matrices, however, so neither A nor B
# will work as neither are square matricies and we will need to create a
# square matrix. To create a 10x10 matrix of random values we can use
# rnorm() to give us a random number rnorm() generates rnadom deviates on
# a normal distribution which then gets multiplied by 10 to create a value
# larger than 1 or 2, which are then converted to an absolute value to
# make all our numbers positive, and finally made into whole integers to
# be more presentable. The replicate function makes 10 vectors into a
# final 10x10 matrix.

C <- replicate(10,as.integer(abs(rnorm(10)*10)))
C.inverse <- solve(C)

# for non-square martices it is possible to calculate the inverse
# using the Moore-Penrose generalized inverse of a matrix included
# in the MASS library and called via the ginv() function:

library(MASS)
ginv(A)

# the determinant of a matrix of a matrix can be found with the
# determinant() function call, aliased to shorted form det(). Determinants
# as with the inverse of a matrix, also require a square matrix. We can
# create a new 10x10 matrix to be called D and then calculate the
# determinant of D:

D <- replicate(10,as.integer(abs(rnorm(10)*10)))
D.det <- det(D)

Tuesday, February 7, 2017

Working with Statistics

A provided sample of data consisting of patient blood pressure measurements and a binomial variable reflecting three different doctor's assessment of the patient's condition. The goal is to generate a boxplot to help describe the relationship between a blood pressure value and each level of doctor's assessment visually and a histogram of the patient blood pressure measurements.

The table of CSV formatted data was pasted into a file and bad characters removed. The below R code reads in the CSV data, creates the three boxplots (one for each level of doctor's review), and the histogram of the bloodpressure values:

# The following data was collected by the local hospital. This data set contains
# 5 variables based on 8 patients. In addition to the measurements of the patients
# checking in to the hospital that night, this data provides the patients' histories
# regarding the frequency of their visits to the hospital in the last 12 months.
# This data displays the measurement of blood pressure, first assessment by general
# doctor (bad=1, good =0) titled "first," the second assessment by external doctor
# (called "second"), and the last row provides the head of the emergency unit's
# decision regarding immediate care for the patient based on the values 0 or 1 (low = 0, high =1).
# Create a side-by-side boxplot and histogram. Discuss the outcome of your findings.
#
# "Freq","bloodp","first”, "second”, ”finaldecision”
# "0.6","103","bad","low","low”
# "0.3","87","bad","low","high”
# "0.4","32","bad","high","low”
# "0.4","42","bad","high","high"
# "0.2","59","good","low","low”
# "0.6","109","good","low","high”
# "0.3","78","good","high","low”
# "0.4","205","good","high","high”
# "0.9","135",”NA","high","high"
# "0.2","176",”bad","high","high”

# Here is Clarification hint:
# Frequency <- c(0.6,0.3,0.4,......
# BP <- c(103,87,32,42,.....
# First <- c(1,1,1,.....
# Second <- c(0,0,1,1,...
# FinalDecision <- c(0,1,0,1,...
# Your objectives is to generate the code for Code and Boxplots – Patients BPs & MD’s Ratings:

patients <- read.csv(file.choose())

boxplot(bloodp ~ first, data=patients)
boxplot(bloodp ~ second, data=patients)
boxplot(bloodp ~ finaldecision, data=patients)

hist(patients$bloodp)

The histogram of the patient blood pressure values generated by the last line of R code above is presented below:

R automatically decided to create five bins or breaks for the blood pressure data, creating ranges of 50 points each. The boxplots for the first, second, and final decision doctor evaluations are:

Blood Pressure / First Assessment

Blood Pressure / Second Assessment

Blood Pressure / Final Decision

The first boxplot has an inverted order compared to the other two plots due to the different words used in the first assessment than the second and final assessments (bad/good vs low/high). R places the plots into alphabetical order. Beyond that quirk, looking at the data we can see that the mean blood pressure value for patients of concern (bad/high) rises as we move from first, to second, to final decision. The mean value for good/low patients likewise moved lower from first, to second, to final decisions.

Wednesday, February 1, 2017

Writing R Functions: Monogram Frequency Count

For a first function in the R language I created a simple function to perform a monogram frequency count on a string and return a table containing the counts. Monogram frequency counts are used as a basic diagnostic technique in written language identification and cryptanalysis. A provided string of text, either plaintext or ciphertext, can reveal some basic characteristics when the frequency of occurrence for each individual letter is determined.

The frequency of occurrence for each letter varies from one language to another, although the results are sensitive to the amount of text provided since small text segments may skew results due to quirks of a subject or word choice of an author. When performing analysis on ciphertext the frequency can help indicate of the ciphertext is from a substitution or a transposition system. In a substitution system the letter frequencies will change. For example, in English the most commonly used letters are E, A, R, I, O, T, N, S according to an analysis of the Concise Oxford Dictionary (11th Edition revised, 2004) with these 8 letters comprising 61% of the letters in the dictionary.

If a frequency distribution of a given set of ciphertext roughly matches the frequency distribution of English and the plaintext is believed to be in English then the order of the letters has likely changed but not their meaning so it is likely a transposition cipher that created the ciphertext.

If the frequency distribution of a given set of ciphertext is different than the distribution of English and the plaintext is believed to be in English then the meaning of each letter has been changed. A plaintext 'E' could now be a 'J' in ciphertext if the distribution shows 'J' is the most frequently occurring letter and to decipher the message we could start by changing 'J's back to 'E's as we attempt to recover the substitution alphabet and decipher the entire message.

For example, taking some text from a current CNN article:

THERES A PURGE OF SPIES UNDERWAY IN MOSCOW WHERE TWO HIGH-RANKING RUSSIAN SECURITY SERVICE AGENTS A CYBERSECURITY EXPERT AND A FOURTH MAN HAVE BEEN CHARGED WITH TREASON FOR PASSING ALONG SECRETS TO AMERICAN INTELLIGENCE ACCORDING TO A LAWYER DEFENDING ONE OF THE MEN THE MEN WERE CHARGED WITH TREASON IN FAVOR OF THE UNITED STATES SAID IVAN PAVLOV THE LAWYER FOR ONE OF THE DEFENDANTS SO FAR THE COUNTERINTELLIGENCE RAID IS TARGETING COMPUTER SECURITY PROFESSIONALS MEN ONCE TRUSTED WITH RUSSIAN GOVERNMENT SECRETS ABOUT HACKING OPERATIONS THE CRACKDOWN COMES SHORTLY AFTER THE US INTELLIGENCE OFFICIALS IN OCTOBER OFFICIALLY ACCUSED RUSSIA OF USING HACKERS TO TRY STEERING THE PRESIDENTIAL ELECTION TO DONALD TRUMP AMERICAN OFFICIALS HAVE NEVER STATED THAT RUSSIAN GOVERNMENT INSIDERS GAVE THEM INFORMATION THAT LED TO THAT ACCUSATION

Using a substitution cipher that shifts the alphabet 13 characters generates the below ciphertext:

GURERF N CHETR BS FCVRF HAQREJNL VA ZBFPBJ JURER GJB UVTU-ENAXVAT EHFFVNA FRPHEVGL FREIVPR NTRAGF N PLOREFRPHEVGL RKCREG NAQ N SBHEGU ZNA UNIR ORRA PUNETRQ JVGU GERNFBA SBE CNFFVAT NYBAT FRPERGF GB NZREVPNA VAGRYYVTRAPR NPPBEQVAT GB N YNJLRE QRSRAQVAT BAR BS GUR ZRA GUR ZRA JRER PUNETRQ JVGU GERNFBA VA SNIBE BS GUR HAVGRQ FGNGRF FNVQ VINA CNIYBI GUR YNJLRE SBE BAR BS GUR QRSRAQNAGF FB SNE GUR PBHAGREVAGRYYVTRAPR ENVQ VF GNETRGVAT PBZCHGRE FRPHEVGL CEBSRFFVBANYF ZRA BAPR GEHFGRQ JVGU EHFFVNA TBIREAZRAG FRPERGF NOBHG UNPXVAT BCRENGVBAF GUR PENPXQBJA PBZRF FUBEGYL NSGRE GUR HF VAGRYYVTRAPR BSSVPVNYF VA BPGBORE BSSVPVNYYL NPPHFRQ EHFFVN BS HFVAT UNPXREF GB GEL FGRREVAT GUR CERFVQRAGVNY RYRPGVBA GB QBANYQ GEHZC NZREVPNA BSSVPVNYF UNIR ARIRE FGNGRQ GUNG EHFFVNA TBIREAZRAG VAFVQREF TNIR GURZ VASBEZNGVBA GUNG YRQ GB GUNG NPPHFNGVBA

Creating a simple R language function to tally the letter frequency would certainly be better than trying to count them ourselves:

freq.monogram <- function(x) {
# Calculates the alphabetic monogram frequency for an input
# string and returns a table containing the results. All
# alphabetic characters are normalized to uppecase.
#
# Args:
# x: String to be processed
#
# Returns:
# Table containing the monogram frequency counts for each
# alphabetic character in the string.

monogram <- vector()
x <- toupper(x)
for (i in 1:nchar(x)) {
current <- substr(x,i,i)
if( current %in% LETTERS) {
monogram <- append(monogram, current)
}
}
return(table(monogram))
}

And now we can simply pass this new function, freq.monogram(), a string and it will perform the counts returning the value in a table which we could then plot to see if it matches what we expect for English (or some other language). The results for the original CNN text (plaintext) reflects an expected frequency distribution with highest frequency letters being E, T, N, A, R, I, S, O:

The results for the ciphertext created with a substitution system does not match English with the highest frequency letters being R, G, N, A, V, E, F, B: