Data Science and Visualization Explorations: February 2017

Sunday, February 26, 2017

File I/O, Transforms, and Filtering

File input and output in the R language has many functions that trace to read.table(). Data files commonly use comma separated value (csv) format but other separator characters are also found in use, such as tab separated and pipe separated, and the read.table()function provides the option to specify custom separator characters for read or writing file formats. The read.table() function can also directly read in a file over a network over the HTTP protocol when provided a url.

If the programmer does not want to supply the R script with a filename and path the file.choose() function can be nested within the read.table() or read.csv() function. The file.choose() function will provide an operating system file selection dialog to browse to a file which then gets passed to read.table() or read.csv() for processing.

The partner function to read.table() is write.table(), with the notable exception that write.table cannot write a file over HTTP or PUT a file.

To explore file I/O and to experiment with the Plyr library we will read in a provided file, determine the average test score broken down by sex for the data set and then write the resulting table to a new file.

Then we will filter the data set based on names containing the letter 'i' (in either case) and write the subset of data based on that filter to a new file.

hw8 <- read.csv(file.choose())

# 1) import dataset.txt

# 2) Run a mean using Sex as the category (use plyr package for this operation),

# then write the resulting output to a file.

# 3) test the DataSet.txt as a dataframe for names whos name contains the

# letter i, then create a new data set with those names, Write those names

# to a file separated by comma’s (CSV)

library(plyr)

mean_by_sex = ddply(hw8, "Sex", transform, Grade.Average=mean(Grade))

write.table(mean_by_sex, file="Mean_by_sex.txt")

the_i_s = subset(hw8, grepl("[iI]", Name))

write.table(the_i_s, file="contains_i.csv", sep=",")

The results of using the ddply() function transform combined with the mean function applied to the column of Grade values selected by the column of Sex is the below table:

Using the R language implementation of the grep function with the regular expression to match upon either the upper or lower case letter 'i" is the below table:

These two tables were output to file in different formats, however, with the average score table output directly as a raw table and the "i" filtered table output in csv format. The contents of these files were viewed to confirm the format as written to disk:

S3 and S4 Object Oriented Principles in R

The R language has three object oriented systems: S3, S4, and Reference Classes (also referred to as R5). A class refers to a type of object and attributes or values describing that specific object and every object must be a member of a class. A method refers to a function associated with an object of a given class. In this post I shall focus only upon S3 and S4 and will not touch upon R5/Reference Classes.

To determine if a given object is of S3 or S4 type there are two common function calls that can be used to determine the object's type. The function isS4() will return TRUE if the supplied object is S4 and false if it is anything other than S4. The function otype(), which is part of the Pryr package, however is more robust since it is not a simple true/false test for S4 but rather returns the object type definitively.

In the above example churn is a valid S3 object, inv101 was a valid S4 object, and inv501 was a simple vector array of three numbers. Since isS4() is simply providing a TRUE or FALSE response for whether or not the provided object is S4 it is not able to differentiate between a vector array and a S3 object while otype() does differentiate.

Determining the base type of an object can be found with a call to typeof(). Calling typeof() with the object itself will return S4 as well, making it another means to determine if an object is S4, but like isS4() it does not positively identify S3 and instead indicates a provided S3 object is of type 'list.' Passing typeof() the class member name will return the base type:

In their 1988 First International Joint Conference of ISSAC paper (http://stepanovpapers.com/genprog.pdf) titled “Generic Programming” David Musser and Alexander Stepanov defined generic programming as centering “around the idea of abstracting from concrete, efficient algorithms to obtain generic algorithms that can be combined with different data representations to produce a wide variety of useful software.” A generic function does not require explicit arguments to be passed to it in order for it to execute. There may be other specific methods or functions which operate differently when additional arguments are supplied to a function call but in a generic function the function can be called without specifying arguments or concern for the different object types. When the function is called the R interpreter attempts to route the call to a function defined for the class of the objects in the function call.

For example, the functions plot() and print() are generic functions and the specific methods available to each generic function can be found by calling the methods() function with the generic function. Calling methods(plot) reveals the methods available to plot() which could be selected by the R interpreter for plotting a supplied set of arguments based on decisions made by the interpreter based on the objects and values present. For example, a generic method would test supplied data or objects to see if it contained a matrix and if so take one set of actions, take a different set of actions if it contained a data.frame, etc.

This abstraction helps simplify aspects of programming and use of functions but it can introduce unexpected behavior and incorrect or misleading results. S3 implemented an unchecked form of object oriented design and S4 was introduced to correct these shortcomings with a stricter design more akin to other existing object oriented systems. Despite the stricter nature of S4, or possibly because of this stricter nature, S3 remains the most commonly used object oriented system in the R language.

S4 requires formal definition of classes and class inheritance while S3 allows the programmer to turn a data frame or a list into a class by simply adding the class attribute to the data frame. The other substantial difference in S4 is that a generic function can be dispatched to a method based on any number of argument classes and not just one as in S3.

In practice this help eliminate some of the inconsistencies or misleading results that would come from passing incorrect or bad arguments to an S3 object or method. In S3 the programmer can create a class and pass objects to the class for processing but if the programmer passes objects incompatible with the functions within that class the R interpreter will not necessarily raise any errors or warnings and simply return NULL values. In S4 the programmer specifically states class definitions for input and output and if a user passes incorrect data to the class function it will return an error.

For some examples I import a dataset of cellular churn data read in from a CSV file, assign a class to the data, and then explore some of the attributes and values as S3:

In the below shot of the R Studio values we can see that churn is now referred to as "Large customers" while before it had been a class it had been a collection of 20,000 observations. Calling class(churn) also added two new attributes to the data set, attr("class") and attr("row.names").

As we execute the commands after class(churn) from the earlier screenshot we can confirm how specific class members or variables are accessed via the $ operator and the results from our tests for being S3 or S4 and base type:

To explore the S4 objects I created a class and a validity function and then test the validity checker as well as similar tests as above as to S3 or S4 membership and base types:

This code creates a class "usedcars" which utilizes a validity function "checkCar" to make sure supplied values fit within the constraints of Price, Mileage, and Year we have specified. In the next code segment we perform some tests on the class, its slots, and then create an instance of the usedcars class with the formal call to the new() function:

The console output for executing these commands is:

The getClass() function returns the 'slots' of the class "usedcars", which can be useful for making sure that when values are being passed for insertion into a new instance of "usedcars" that the correct slot names are utilized. inv101 receives the new instance of "usedcars" and since there were no error messages returned to the console we can be confident that the correct data types were passed and the values were in the allowed ranges.

In the below code we test for S4 and object type, test the validity detection function of the "usedcars" class, and direct access values using the @ referential operator (which differs from the S3 $ operator):

The integrity check successfully caught the out of specification attempts to assign values when created inv102 and inv103 above. The input price for inv102 was input too high in error and both the price and year values for inv103 were in error.

Monday, February 13, 2017

Matrix Operations

In the R Language the transpose of a given matrix is returned by the t() function. For the given 6x17 matrix 'A' the transpose is the below 17x6 matrix by calling A.transpose <- t(A):

For the given 6x167 matrix 'B" the transpose is 167x6 and is too large to present in full here:

Multiplying a matrix by a vector is a simple and direct action, simply requiring that the supplied vector is of a matching dimension:

To calculate the inverse of a matrix the matrix needs to be a square matrix so the given A and B matrices supplied will not work and we will need to make a square matrix. We can use R to generate a square matrix, named 'C', of 10x10 dimensions populated by random numbers and then calculate the inverse of that matrix:

In the 'MASS' library, however, there is also an implementation of the Moore-Penrose generalized inverse of a matrix that can be used on non-square matrices. If we import the MASS library and then call this generalized inverse function, ginv(), we can calculate the inverse of the A or B matrix:

To calculate the determinant of a matrix the matrix, once again, needs to be a square matrix. For this we can use the same code we used in generate the C 10x10 matrix in order to generate a new matrix, D, of dimensions 10x10. The function det() will return the determinant:

Full Code:

# Find the values of transpose of a matrix, multiplying a matrix by a
# vector, inverse of a matrix, determinant of a matrix by using the
# following values:

# A=matrix(1:100, nrow=6)
# B=matrix(1:1000, nrow=6)

# 1. transponse of a matrix
# 2. multiply by a vector
# 3. inverse of a matrix
# 4. determinant of a matrix

A <- matrix(1:100, nrow=6)
B <- matrix(1:1000, nrow=6)

# matrix transpose can be performed with the 't()' function
A.transpose <- t(A)
B.transpose <- t(B)

# multiplying a matrix by a vector if the vector is of a matching
# dimension for the given matrix
A.vectormult <- A * c(9,9,9,9,9,9,9,9,4,4,4,4,4,4,4,4,4)
A.vectormult_fail <- A * c(9,9,9,9,9,9,9,9,4,4,4,4,4,4,4,4,4,5)

# the inverse of a matrix is found with the solve() function but it
# can only be calculated on square matrices, however, so neither A nor B
# will work as neither are square matricies and we will need to create a
# square matrix. To create a 10x10 matrix of random values we can use
# rnorm() to give us a random number rnorm() generates rnadom deviates on
# a normal distribution which then gets multiplied by 10 to create a value
# larger than 1 or 2, which are then converted to an absolute value to
# make all our numbers positive, and finally made into whole integers to
# be more presentable. The replicate function makes 10 vectors into a
# final 10x10 matrix.

C <- replicate(10,as.integer(abs(rnorm(10)*10)))
C.inverse <- solve(C)

# for non-square martices it is possible to calculate the inverse
# using the Moore-Penrose generalized inverse of a matrix included
# in the MASS library and called via the ginv() function:

library(MASS)
ginv(A)

# the determinant of a matrix of a matrix can be found with the
# determinant() function call, aliased to shorted form det(). Determinants
# as with the inverse of a matrix, also require a square matrix. We can
# create a new 10x10 matrix to be called D and then calculate the
# determinant of D:

D <- replicate(10,as.integer(abs(rnorm(10)*10)))
D.det <- det(D)

Tuesday, February 7, 2017

Working with Statistics

A provided sample of data consisting of patient blood pressure measurements and a binomial variable reflecting three different doctor's assessment of the patient's condition. The goal is to generate a boxplot to help describe the relationship between a blood pressure value and each level of doctor's assessment visually and a histogram of the patient blood pressure measurements.

The table of CSV formatted data was pasted into a file and bad characters removed. The below R code reads in the CSV data, creates the three boxplots (one for each level of doctor's review), and the histogram of the bloodpressure values:

# The following data was collected by the local hospital. This data set contains
# 5 variables based on 8 patients. In addition to the measurements of the patients
# checking in to the hospital that night, this data provides the patients' histories
# regarding the frequency of their visits to the hospital in the last 12 months.
# This data displays the measurement of blood pressure, first assessment by general
# doctor (bad=1, good =0) titled "first," the second assessment by external doctor
# (called "second"), and the last row provides the head of the emergency unit's
# decision regarding immediate care for the patient based on the values 0 or 1 (low = 0, high =1).
# Create a side-by-side boxplot and histogram. Discuss the outcome of your findings.
#
# "Freq","bloodp","first”, "second”, ”finaldecision”
# "0.6","103","bad","low","low”
# "0.3","87","bad","low","high”
# "0.4","32","bad","high","low”
# "0.4","42","bad","high","high"
# "0.2","59","good","low","low”
# "0.6","109","good","low","high”
# "0.3","78","good","high","low”
# "0.4","205","good","high","high”
# "0.9","135",”NA","high","high"
# "0.2","176",”bad","high","high”

# Here is Clarification hint:
# Frequency <- c(0.6,0.3,0.4,......
# BP <- c(103,87,32,42,.....
# First <- c(1,1,1,.....
# Second <- c(0,0,1,1,...
# FinalDecision <- c(0,1,0,1,...
# Your objectives is to generate the code for Code and Boxplots – Patients BPs & MD’s Ratings:

patients <- read.csv(file.choose())

boxplot(bloodp ~ first, data=patients)
boxplot(bloodp ~ second, data=patients)
boxplot(bloodp ~ finaldecision, data=patients)

hist(patients$bloodp)

The histogram of the patient blood pressure values generated by the last line of R code above is presented below:

R automatically decided to create five bins or breaks for the blood pressure data, creating ranges of 50 points each. The boxplots for the first, second, and final decision doctor evaluations are:

Blood Pressure / First Assessment

Blood Pressure / Second Assessment

Blood Pressure / Final Decision

The first boxplot has an inverted order compared to the other two plots due to the different words used in the first assessment than the second and final assessments (bad/good vs low/high). R places the plots into alphabetical order. Beyond that quirk, looking at the data we can see that the mean blood pressure value for patients of concern (bad/high) rises as we move from first, to second, to final decision. The mean value for good/low patients likewise moved lower from first, to second, to final decisions.

Wednesday, February 1, 2017

Writing R Functions: Monogram Frequency Count

For a first function in the R language I created a simple function to perform a monogram frequency count on a string and return a table containing the counts. Monogram frequency counts are used as a basic diagnostic technique in written language identification and cryptanalysis. A provided string of text, either plaintext or ciphertext, can reveal some basic characteristics when the frequency of occurrence for each individual letter is determined.

The frequency of occurrence for each letter varies from one language to another, although the results are sensitive to the amount of text provided since small text segments may skew results due to quirks of a subject or word choice of an author. When performing analysis on ciphertext the frequency can help indicate of the ciphertext is from a substitution or a transposition system. In a substitution system the letter frequencies will change. For example, in English the most commonly used letters are E, A, R, I, O, T, N, S according to an analysis of the Concise Oxford Dictionary (11th Edition revised, 2004) with these 8 letters comprising 61% of the letters in the dictionary.

If a frequency distribution of a given set of ciphertext roughly matches the frequency distribution of English and the plaintext is believed to be in English then the order of the letters has likely changed but not their meaning so it is likely a transposition cipher that created the ciphertext.

If the frequency distribution of a given set of ciphertext is different than the distribution of English and the plaintext is believed to be in English then the meaning of each letter has been changed. A plaintext 'E' could now be a 'J' in ciphertext if the distribution shows 'J' is the most frequently occurring letter and to decipher the message we could start by changing 'J's back to 'E's as we attempt to recover the substitution alphabet and decipher the entire message.

For example, taking some text from a current CNN article:

THERES A PURGE OF SPIES UNDERWAY IN MOSCOW WHERE TWO HIGH-RANKING RUSSIAN SECURITY SERVICE AGENTS A CYBERSECURITY EXPERT AND A FOURTH MAN HAVE BEEN CHARGED WITH TREASON FOR PASSING ALONG SECRETS TO AMERICAN INTELLIGENCE ACCORDING TO A LAWYER DEFENDING ONE OF THE MEN THE MEN WERE CHARGED WITH TREASON IN FAVOR OF THE UNITED STATES SAID IVAN PAVLOV THE LAWYER FOR ONE OF THE DEFENDANTS SO FAR THE COUNTERINTELLIGENCE RAID IS TARGETING COMPUTER SECURITY PROFESSIONALS MEN ONCE TRUSTED WITH RUSSIAN GOVERNMENT SECRETS ABOUT HACKING OPERATIONS THE CRACKDOWN COMES SHORTLY AFTER THE US INTELLIGENCE OFFICIALS IN OCTOBER OFFICIALLY ACCUSED RUSSIA OF USING HACKERS TO TRY STEERING THE PRESIDENTIAL ELECTION TO DONALD TRUMP AMERICAN OFFICIALS HAVE NEVER STATED THAT RUSSIAN GOVERNMENT INSIDERS GAVE THEM INFORMATION THAT LED TO THAT ACCUSATION

Using a substitution cipher that shifts the alphabet 13 characters generates the below ciphertext:

GURERF N CHETR BS FCVRF HAQREJNL VA ZBFPBJ JURER GJB UVTU-ENAXVAT EHFFVNA FRPHEVGL FREIVPR NTRAGF N PLOREFRPHEVGL RKCREG NAQ N SBHEGU ZNA UNIR ORRA PUNETRQ JVGU GERNFBA SBE CNFFVAT NYBAT FRPERGF GB NZREVPNA VAGRYYVTRAPR NPPBEQVAT GB N YNJLRE QRSRAQVAT BAR BS GUR ZRA GUR ZRA JRER PUNETRQ JVGU GERNFBA VA SNIBE BS GUR HAVGRQ FGNGRF FNVQ VINA CNIYBI GUR YNJLRE SBE BAR BS GUR QRSRAQNAGF FB SNE GUR PBHAGREVAGRYYVTRAPR ENVQ VF GNETRGVAT PBZCHGRE FRPHEVGL CEBSRFFVBANYF ZRA BAPR GEHFGRQ JVGU EHFFVNA TBIREAZRAG FRPERGF NOBHG UNPXVAT BCRENGVBAF GUR PENPXQBJA PBZRF FUBEGYL NSGRE GUR HF VAGRYYVTRAPR BSSVPVNYF VA BPGBORE BSSVPVNYYL NPPHFRQ EHFFVN BS HFVAT UNPXREF GB GEL FGRREVAT GUR CERFVQRAGVNY RYRPGVBA GB QBANYQ GEHZC NZREVPNA BSSVPVNYF UNIR ARIRE FGNGRQ GUNG EHFFVNA TBIREAZRAG VAFVQREF TNIR GURZ VASBEZNGVBA GUNG YRQ GB GUNG NPPHFNGVBA

Creating a simple R language function to tally the letter frequency would certainly be better than trying to count them ourselves:

freq.monogram <- function(x) {
# Calculates the alphabetic monogram frequency for an input
# string and returns a table containing the results. All
# alphabetic characters are normalized to uppecase.
#
# Args:
# x: String to be processed
#
# Returns:
# Table containing the monogram frequency counts for each
# alphabetic character in the string.

monogram <- vector()
x <- toupper(x)
for (i in 1:nchar(x)) {
current <- substr(x,i,i)
if( current %in% LETTERS) {
monogram <- append(monogram, current)
}
}
return(table(monogram))
}

And now we can simply pass this new function, freq.monogram(), a string and it will perform the counts returning the value in a table which we could then plot to see if it matches what we expect for English (or some other language). The results for the original CNN text (plaintext) reflects an expected frequency distribution with highest frequency letters being E, T, N, A, R, I, S, O:

The results for the ciphertext created with a substitution system does not match English with the highest frequency letters being R, G, N, A, V, E, F, B: