Data Science and Visualization Explorations: 2016

Wednesday, November 30, 2016

Some Visualizations of The 2016 Election

The outcome of the 2016 Presidential Election turned out to surprise most political analysts and pollsters. While it had been considered a close race with, as it turned out, the popular vote conflicting with the Electoral College results the conventional wisdom among the experts was that Hillary Clinton would win the election. On election night Donald Trump won by an Electoral College landslide prompting many of those some experts to question how and why their polls and analysis had incorrectly predicted the outcome.

Obtaining The Data

Analysis cannot be performed without data so our first step was to obtain election results data. Emil Kirkegaard, a self-described statistician from Denmark, posted a dataset soon after the election at https://github.com/Deleetdk/USA.county.data containing county-level election results combined with other county-level demographics data in per-row CSV format. After taking a look at the demographic data there was one set of county data missing that I wanted to include in my analysis: population density. This data is available from the US Government’s Census website http://www.census.gov/quickfacts/meta/long_LND110210.htm consisting of data from 2010.

Merging and Cleaning The Data

The dataset from Mr. Kirkegaard needed to be merged with the file downloaded from the Census website. The Census data required the most effort in terms of converting and cleaning followed by merging row-wise with the election results data, removing Census data rows not having matching county rows. The election results file, for example, did not have county data for Alaska requiring the deletion of all county rows from the Census data. The resulting merged and cleaned data set contains 3111 observations with 129 variables, with some variables (or columns) containing duplicate data.

Importing In to R for Analysis and Visualization

The free and open source R language provides a powerful and simplified means of importing and manipulating datasets such as this without having to purchase commercial solution such as Tableau. After importing the data in CSV format we can confirm that R properly interpreted most of the columns without any guidance from us with the exception of the “trump_win” column which was interpreted as an integer but we want to have it instead considered a category or “factor” within R.

Starting Basic: One State's Data

Before we dive in too deep and before we've explored the dataset let's take one state, Maryland, and with R perform some very basic visualizations. Maryland is a solid "Blue" state, having voted for a Democratic President in the vast majority of elections the past century. Does that mean that every county in Maryland votes "Blue"? And how strongly "Blue" is each county? To explore this we will take the number of votes for Clinton and subtract the number of votes for Trump and plot the results by county on a bar chart.

The results indicate three counties did vote strongly for Clinton but more individual counties (16 counties Trump vs 7 counties Clinton) had slightly more votes for Trump than for Clinton.

Baseline Test Plot With Maps

Having tested the dataset waters by zeroing in on a single state we can now stretch a bit further. To begin with let’s plot the counties won by each candidate. Note that this does not directly map to the Electoral College wins since some counties may have a large area but not have a proportionate impact on the Electoral College vote due to their population density. In R we can use the factor data in the variable $trump_win to quickly plot a map of the election results in our data set. When the results are plotted we can also quickly see the counties for which we are missing data but overall it looks like we are in decent shape.

We have now tested our ability to take a single binary value ($trump_win) and plot it county-wise on a map of the United States. We can now move on to trying to explore some other values and ideas using R and this visualization capability.

Voter Defections or Failure to Show at the Polls

A common theme in the media and among much of the general public was that both candidates were disliked to a greater degree than in any other recent election. Did this keep voters away from the polls? Or in some measures a worse thought, did registered voters for one party switch and vote for the other party's candidate? Using raw numbers it can be harder to determine the later but statistics are generally available to determine overall voter turnout across all registered voters but can we use this dataset to try to get a sense of Democrats who stayed away or voted for the Republican candidate and vice versa?

To attempt this we will take the number of votes for each candidate and subtract the number of registered voters for that party as of 2012. This is not going to be a very good measure because media reports indicated that large numbers of people registered to vote before the 2016 election so, in theory, those 2012 numbers should be lower than the actual 2016 numbers that are not in our dataset.

We will start with the Democrats and use a choropleth graphing technique to create six sets of bins for values representing the difference of the votes for Clinton and the number of registered Democrats in each county. The plot below is a bit difficult to understand due to how R created the bins evenly across the range of values and the colors it selected. The bright purple and blue areas reflect the largest increases in votes for a Democrat after subtracting out the number of registered Democrats from 2012. This would imply increased voter registrations in 2016 not collected by our data and/or Republicans switching sides and voting for the Democratic candidate in those counties. It's worth noting the strong purple segment running straight up the center of the country from Texas to Nebraska as well as large segments of Utah and Florida.

The olive green and pink areas reflect a negative value after subtracting the number of registered Democrats from the votes for the Democratic candidate. It is curious that large segments seem to show a negative turnout in California, Washington state, and a lot of the counties in the Northeast, areas generally considered strongly Democratic.

What happens when we do the same analysis for the Republicans? That segment of land from Texas up to Nebraska has an olive green vein running through it but note that R assigned less extreme values to this color segment for this data so while olive green is not a "good" value it does not represent as strong of a negative value as it did on the Democratic map. That said, it does look like the Democrats strengthened in the Midwest counties even though they ultimately lost the Electoral College votes for those states in 2016.

Party Differences by County

Another conventional wisdom of this election is that the country is more polarized and that those living in the more densely populated urban areas are living in a different America than those in the more rural areas. Does this mean that rural and urban counties are more heavily weighted to one party than the other? One probably less-than-ideal way to explore this could be to subtract the number of registered voters of one party from the registered voters of the other and see the difference as such:

In the above plot I subtracted the number of 2012 registered Democrats from the 2012 registered Republicans, meaning that from olive green to reddish-pink counties are more Democratic and everything from the green to blue to purplish-pink counties are more Republican.

A better measure would be to use a percentage of the registered voters:

This is based on the percentage of registered Republicans in 2012 so roughly green through the purplish-pink are greater than 50% Republicans while olive green and reddish-pink are greater than 50% registered Democrats.

What Role Does Population Density Play?

It seems that as areas become more dense in population those areas seem to vote Democratic. Can we actually see that influence in the data? This is why I imported the additional rows of county data from the census.gov website as the had data rows on population density per square mile of land. Is it possible to take that population density figure and combine it with the voting data to create some sort of metric? In R as a test I took the population density figure and divided it by the difference of the number of votes for the Republican candidate subtracted by the number of votes for the Democratic candidate. The theory being that in large population density areas that voted Democratic those large numbers would remain large because they would be closer to one or would turn negative. In less dense areas those numbers would become very small as they would be divided by the larger number remaining after having its Democratic votes subtracted from the Republican votes.

This plot does show a number of counties with negative values and a good number of counties are also olive green which is hovering right around zero. In fact a lot of the values are very near zero with only the far ends of the value range being far removed from zero. It's so bad that a box plot of this value in R is not even recognizable as a boxplot:

This is not a good measure of anything in its current form but could be explored further taking into account some other factors or using a better normalizing method on the values. What if we didn't try to make a range of six categories but instead simply went with a binary value and tried to use population density itself as a deciding factor and compare that to the vote results? We could do this by reducing the number of categories to 2 and then use R to calculate the accuracy of our results against the actual votes.

A resulting accuracy that is basically the same thing as a random guess. It would appear a blind application of population density is insufficient. But there is a lot of other demographic data in this data set that could be integrated in to some regression or classifier algorithms to possibly generate a more accurate prediction.

library(maps)

library(ggplot2)

# read in the datafile

election <- read.csv(file.choose())

# change the trump_win value to be a factor

election$trump_win <- as.factor(election$trump_win)

# change the 'name' variable to 'county_name'

colnames(election)[2] <- "county_name"

# change the values in 'county_name' to all lower case for

# easy merge with the map data later

election$county_name <- tolower(election$county_name)

# change the 'ST' variable to 'state' for easy merge with map data later

colnames(election)[11] <- "state"

county_map <- map_data("county")

names(county_map) <- c("long", "lat", "group", "order", "state_name", "county_name")

county_map$state <- state.abb[match(county_map$state_name, tolower(state.name))]

county_map$state_name <- NULL

state_map <- map_data("state")

# combine county and state borders

choropleth <- merge(county_map, election, by = c("state", "county_name"))

choropleth <- choropleth[order(choropleth$order), ]

# to help ggplot with an inheritence issue due to column name

# confusion introduced by having two columns named 'lat' before the merge

choropleth$lat <- choropleth$lat.x

###

# we now have the general structure laid out for the data frames

# to be used later but before we get too into that let's pick a

# state and do a basic look at the voting statistics by county.

# we will take Maryland, a solid Democrat voting state and take

# a look at how "blue" or not it is

maryland <- subset(election, state == "MD")

ggplot(maryland, aes(county_name , results.clintonh - results.trumpd)) + geom_bar(position = "dodge", stat="identity", aes(group = county_name)) + coord_flip()

# we've subtracted the number of Trump votes from the number of

# Clinton votes and plotted that on a bar chart to see how strongly

# each county voted

# now let's look at a national level map

# to begin with let's plot the counties won by each candidate

choropleth$plotv <- choropleth$trump_win

# plot the map of the election results by county

ggplot(choropleth, aes(long, lat, group = group)) +

geom_polygon(aes(fill = plotv)) + scale_fill_manual(values = alpha(c("blue", "red"))) +

geom_polygon(data = state_map, color = "white", fill = NA) +

ggtitle("2016 Election Results by County") +

labs(fill = "Party")

###

# Question 1: Votes that seemed to switch sides

# Starting with Republican votes

###

choropleth$votes_v_regist <- choropleth$results.trumpd - choropleth$rep12

# Create a scale for use with a Brewer color scheme with 6 levels

choropleth$plotv <- cut_number(choropleth$votes_v_regist, 6)

ggplot(choropleth, aes(long, lat, group = group)) +

geom_polygon(aes(fill = plotv), color = alpha("white", 1/2), size = 0.2) +

geom_polygon(data = state_map, color = "white", fill = NA)

###

# And now with the Democratic votes

###

choropleth$votes_v_regist <- choropleth$results.clintonh - choropleth$dem12

# Create a scale for use with a Brewer color scheme with 6 levels

choropleth$plotv <- cut_number(choropleth$votes_v_regist, 6)

ggplot(choropleth, aes(long, lat, group = group)) +

geom_polygon(aes(fill = plotv), color = alpha("white", 1/2), size = 0.2) +

geom_polygon(data = state_map, color = "white", fill = NA)

###

# What sort of difference is there in the # of registered

# voters for each party?

###

choropleth$regist <- choropleth$rep12 - choropleth$dem12

# Create a scale for use with a Brewer color scheme with 6 levels

choropleth$plotv <- cut_number(choropleth$regist, 6)

ggplot(choropleth, aes(long, lat, group = group)) +

geom_polygon(aes(fill = plotv), color = alpha("white", 1/2), size = 0.2) +

geom_polygon(data = state_map, color = "white", fill = NA)

# repeat above but only using the percentage of 2012 Republicans

choropleth$regist <- choropleth$rep12_frac

# Create a scale for use with a Brewer color scheme with 6 levels

choropleth$plotv <- cut_number(choropleth$regist, 6)

ggplot(choropleth, aes(long, lat, group = group)) +

geom_polygon(aes(fill = plotv), color = alpha("white", 1/2), size = 0.2) +

geom_polygon(data = state_map, color = "white", fill = NA)

###

# influence of population density on vote outcome

###

choropleth$votedensity <- as.numeric(choropleth$Population.Density.per.square.mile.of.land.area) / (choropleth$results.trumpd - choropleth$results.clintonh)

# Create a scale for use with a Brewer color scheme with 6 levels

choropleth$plotv <- cut_number(choropleth$votedensity, 6)

ggplot(choropleth, aes(long, lat, group = group)) +

geom_polygon(aes(fill = plotv), color = alpha("white", 1/2), size = 0.2) +

geom_polygon(data = state_map, color = "white", fill = NA)

# the resulting map looks a lot like the map of the voting outcome so perhaps

# this isnt really telling us much. what does it look like on a 6 level scale

# with only the population density?

choropleth$votedensity <- as.numeric(choropleth$Population.Density.per.square.mile.of.land.area)

# Create a scale for use with a Brewer color scheme with 2 levels

# because six was too busy to try to understand and 2 will also

# be easier to compare to the election results map

choropleth$plotv <- cut_number(choropleth$votedensity, 2)

ggplot(choropleth, aes(long, lat, group = group)) +

geom_polygon(aes(fill = plotv), color = alpha("white", 1/2), size = 0.2) +

geom_polygon(data = state_map, color = "white", fill = NA)

# what is our accuracy if we just use pop density to predict vote direction?

choropleth$predictions <- as.numeric(choropleth$plotv) - 1

misclassified <- mean(choropleth$predictions != choropleth$trump_win)

print(paste('Accuracy is ', 1 - misclassified))

# so in reality it doesnt look better than chance relying simply on the

# density but I do wonder if the very small counties in some of the midwest

# states skews the result and we should find some more normalizing methods?

Sunday, November 13, 2016

From Bad Visualization to Better

Taking a second look at the quality of the design and presentation of a data subset from an earlier post I will seek to improve one visualization from the 2 November 2016 posting “Attempts at Visualizing a Large Dataset” (https://dataviz2016.blogspot.com/2016/11/attempts-at-visualizing-large-dataset.html). Several of the visualizations in this post are of reduced value or utility, largely due to the visualization trying to convey too much information and from differing scales.

The visualization I will address is the pie chart “1996 Registration Orgs.” This was displayed as a pie chart in order to meet other requirements but pie charts but there are many problems with the use of pie charts and this visualization in particular is a terrible use of a pie chart. There are too many categories to display (21 different values), there is one category consuming the vast majority of the pie, it is not possible to read the names of many of the categories, the distribution and selection of colors makes it difficult to differentiate the categories, and the difference in magnitude of several categories makes it impossible to judge what the actual values are for different categories or even determine the difference in magnitude scale between them. In short, pie chart is the wrong visualization type to use.

To create a more useful and readable visualization of the different organizations with a 1996 registration start date I will change this from a pie chart to a bar chart, use a chart title, label the X and Y axis, plot in a flat two-dimensional perspective to prevent any misleading impressions due to the mind's interpretation of angles, area, and scale.

Removing organizations that had an index (or x) value but counts (y values) of zero reduces the clutter in the plot; if the organization has no counts then there is not necessarily a need to plot it since the absence of presence can also reflect a zero value. Using different colors assigned to each organization type also helps differentiate the groups. But when this is plotted the sizable difference in magnitude for "PRIVATE" is a problem as it makes all the other values unreadable and impossible to compare.

Changing the y-scale to a log scale reduces this effect as shown below.

Now we can see all the values, sorted in descending order, and on a log scale to be able to include PRIVATE with all the other organization types. Using the Evergreen and Emery Data Visualization Checklist to compare the rating scores of the above horizontal bar chart to the prior pie chart, presented below again for single screen comparison, reflects an improvement in just about every guideline.

Associated R Source Code:

library(ggplot2)

florida <- read.csv(file.choose())

# How many registered entities have registered addresses outside
# of Florida?

outofstate <- subset(florida, !(STATE_CODE %in% c("FL")))
# ggplot can count up how many different values there are for STATE_CODE
# and then plot the results in a bar chart for us
ggplot(outofstate, aes(outofstate$STATE_CODE)) + geom_bar(stat="count") + ggtitle("Out of State Registrations") + xlab("State")

# Are there any trends, bursts, or drop-offs for the registration
# start dates? What do they look like if we plot them with a line
# graph? This would be a plot of time series data binned by date

# fix the date format from levels to a date format understood by R
florida$START_DATE <- as.Date(florida$START_DATE, format="%d-%b-%y")

# now let's limit ourselves to 1970-2016 because there is some odd
# data in this set as well as 72,620 records without a valid START_DATE
# and we have to plot the y-axis on a log scale because of an
# unusually large number of values in the late 1990's
ggplot(florida, aes(florida$START_DATE)) + geom_line(stat="count") + scale_x_date(limits = c( as.Date(c("1970-01-01")), as.Date(c("2016-12-31")))) + coord_trans(y="log10")

# let's isolate that strange year in the mid/late 1990s with the surge
# of what appears to be new registrations

# our earlier ggplot was a bit more complicated because it directly accessed the full data
# but we can be clearer in our code and effort if we strip some details down
# so let's create a vector with only the year of the START_DATE

isolate <- as.Date(florida$START_DATE, format = "%Y-%m-%d")

# and convert it from a date format to an integer
isolate <- as.numeric(isolate)

# now let's create a summary table with counts by year
year_counts <- as.data.frame(table(isolate))

# and the top six years for entries is
head(year_counts[rev(order(year_counts$Freq)),])

# telling us that 1996 has 21,094 START_DATEs
# with the year isolated what type of organization or entity is this? let's
# plot ORG_TYPE as a pie chart and take a look at the results for this year

stage1 <- subset(florida, (START_DATE > as.Date("1995-12-31")))
ninetysix <- subset(stage1, ( START_DATE < as.Date("1997-01-01")))

# we now have a dataset containing only STATE_DATEs in 1996 in 'ninetysix'

org_types <- as.data.frame(table(ninetysix$ORG_TYPE))
pie(org_types$Freq, main="1996 Registration Orgs", labels=org_types$Var1)

# this does not pie chart well due to PRIVATE being almost 40 times
# bigger than the next largest but what if we just zero it out

#org_types$Freq[17] <- 0
#pie(org_types$Freq, main="1996 Registration Orgs", labels=org_types$Var1)

# remove zero values
cleaned <- subset(org_types, (Freq > 0))
colnames(cleaned) <- c("Org.Type", "Number")

# change the sort order within the plot such that they are in order and
# not randomly placed making it easier to differentiate between the variables

disp <- ggplot(cleaned, aes(x=reorder(Org.Type, Number), y=Number, color, fill=Org.Type)) +
geom_bar(stat='identity', position='identity') + ggtitle("1996 Registered Organization Types") + labs(x="Organization",y="Number of Orgs") + coord_flip()
disp

# plot using a log(y) scale

disp <- ggplot(cleaned, aes(x=reorder(Org.Type, Number), y=Number, color, fill=Org.Type)) +
geom_bar(stat='identity', position='identity') + scale_y_log10(cleaned$Number) + ggtitle("1996 Registered Organization Types") + labs(x="Organization",y="log(Number of Orgs)") + coord_flip()
disp

Saturday, November 5, 2016

Basic Animations with R Shiny, the Animation Library, and ImageMagick

In this post I take a look at the use of animation in R with the Shiny and Animation libraries. The animations are created from a series of individual graphs plotted by R and then stitched together to form an animation.

The below code sample was obtained from the Analysis Programming blog ( http://alstatr.blogspot.com/2014/02/r-animating-2d-and-3d-plots.html ):

library(animation)

saveGIF({
for(i in 1:150){
x <- seq(-6 + (i * 0.05), 6 + (i * 0.05), length= 100)
y <- x
f <- function(x, y) { sin(x) + cos(y) }
z <- outer(x, y, f)
persp(x, y, z, theta = 45 + (i * 0.5), phi = 35, expand = 0.4, col = "lightblue")
}
}, interval = 0.1, ani.width = 550, ani.height = 550)

The saveGIF() function has some non-R dependencies in order to convert a series of PNG format still images created by the animation routine into an animated GIF. The dependency was resolved by installing ImageMagick ( http://imagemagick.org/script/binary-releases.php ) and then restarting R Studio.

The resulting animated GIF file of the sin(x)+cos(y) function is below. The persp() function call rotates the view of the graph.

Having found a successful piece of sample code and generated an animated GIF from a series of PNG graph plots I decided to modify the working code and replace the function with something different and see if I could have that code work successfully.

In the interests of simplicity and not requiring external data sets I decided to animate a series of histogram plots of generated random numbers and plot a normal line over the histogram. Using unweighted random numbers does not make for an interesting histogram, however, so I also had to create a random set of weights to be used when generating the random numbers.

The sample() function is well suited for these tasks with the mix vector being assigned 25 random numbers between 1 and 100 with replacement. This mix variable is then used to create the dist vector that is simply the original value divided by 100 to obtain a probability value we will use in the next sample() function call.

I generated only 25 random numbers for our probabilities because the random numbers we will be plotting in the histogram will have the potential values between 1 and 25 so we need exactly 25 probabilities to pass to sample() as the probability distribution vector when we generate 100 values between 1 and 25.

saveGIF({

for(i in 1:100){

mix <- sample(1:100, 25, replace = T)

dist <- t(t(mix) / 100)

set <- sample(1:25, 100, replace = T, prob = dist)

h<-hist(set, breaks=10, density=10, col="darkblue")

xfit<-seq(min(set),max(set))

yfit<-dnorm(xfit,mean=mean(set),sd=sd(set))

yfit <- yfit*diff(h$mids[1:2])*length(set)

lines(xfit, yfit, col="red", lwd=2)

}

}, interval = 0.3, ani.width = 600, ani.height=600)

This set of 100 random values is then plotted with a histogram and a normal line fitted to the plot. All of this code is encapsulated in a for-loop that iterates 100 times creating 100 PNG images which are then passed to the saveGIF() function that calls upon ImageMagick to create the animated GIF. I slowed down the interval for saveGIF slightly to 0.3 as the prior value of 0.1 flipped through the images a little too quickly for a set of histogram plots compared to a surface plot of sin and cos.

Wednesday, November 2, 2016

Attempts at Visualizing a Large Dataset

In this post we at a look at three visualizations from a large data set of 257,567 rows with 24 variables per row consisting of a mix of numeric, categorical, and geo data. The task is to create at least one bar chart, line graph, and pie chart and we will be performing these tasks with the R language.

The data set is in the form of a CSV formatted file containing organizational registration data for the state of Florida that we read in and then create a new subset of data to explore the question how many organizations are not listed with a Florida address. With that subset of data stored in "outofstate" we can directly plot this with ggplot() as a bar chart.

The first and largest column on the far left is for values that had no state listed for their registration address. The highest frequency state was GA followed by TX.

Next we take a look at the data as a time series based on the registered START_DATE variable. We have to reinterpret the START_DATE field as a formatted date value and it's apparent that while data exists for the 1930's through to future dates the majority of the registrations are in the 1980s through to current day. For our plot we will limit the date range to 1970-2016 and create a line plot on the number of registrations each year between 1970 and 2016. The first attempt at the plot was impossible to read because of the scaling of the y-axis due to one value that was massively larger than all other values. To reduce the impact of this one value on the scale of the rest of the plot we changed the y-axis to be on log scale.

Looking at this plot the number of registrations in the data set in the 1970s was relatively light and consistent but really took off in the 1980s, became more regular in frequency and volume in the late 1980s, but at some point in the mid-to-late 1990s is where we have that one year with a significantly larger value that prompted us to use the log scale for the y-axis. Which year was that? With the plot options and scaling it is not possible to tell so we'll use R to take a look and ir looks like it is 1996.

With the year isolated what is the make up of the organization types registering in 1996? We will create a new data set containing only rows for 1996 with two subset instructions, then another data set containing only the ORG_TYPE and counts of those values for 1996. This data set can be directly used with pie() to create a pie chart.

Once again we have one value that is ruining the scale and ability to visualize the organizations; PRIVATE is far and away the most frequent value for registered entities in 1996. Let's remove PRIVATE from the data set we should be able to get some better sense of the other values present.

This allows us to get a better sense of the organizations that were not PRIVATE, although the way R displays the labels could be improved through the use of a legend to the side of the pie chart instead of attempting to label the slices directly.

R Source Code:

# Assignment:
# Based on the data, create three different types of visualizations
# that include: Bar Chart, Line Graph, and Pie chart. Under each type
# of visualization, provide a short summary and discuss what attributes
# the visualization provides for understanding the data
#

library(ggplot2)

florida <- read.csv(file.choose())

# How many registered entities have registered addresses outside
# of Florida?

outofstate <- subset(florida, !(STATE_CODE %in% c("FL")))
# ggplot can count up how many different values there are for STATE_CODE
# and then plot the results in a bar chart for us
ggplot(outofstate, aes(outofstate$STATE_CODE)) + geom_bar(stat="count") + ggtitle("Out of State Registrations") + xlab("State")

# Are there any trends, bursts, or drop-offs for the registration
# start dates? What do they look like if we plot them with a line
# graph? This would be a plot of time series data binned by date

# fix the date format from levels to a date format understood by R
florida$START_DATE <- as.Date(florida$START_DATE, format="%d-%b-%y")

# now let's limit ourselves to 1970-2016 because there is some odd
# data in this set as well as 72,620 records without a valid START_DATE
# and we have to plot the y-axis on a log scale because of an
# unusually large number of values in the late 1990's
ggplot(florida, aes(florida$START_DATE)) + geom_line(stat="count") + scale_x_date(limits = c( as.Date(c("1970-01-01")), as.Date(c("2016-12-31")))) + coord_trans(y="log10")

# let's isolate that strange year in the mid/late 1990s with the surge
# of what appears to be new registrations

# our earlier ggplot was a bit more complicated because it directly accessed the full data
# but we can be clearer in our code and effort if we strip some details down
# so let's create a vector with only the year of the START_DATE

isolate <- as.Date(florida$START_DATE, format = "%Y-%m-%d")

# and convert it from a date format to an integer
isolate <- as.numeric(isolate)

# now let's create a summary table with counts by year
year_counts <- as.data.frame(table(isolate))

# and the top six years for entries is
head(year_counts[rev(order(year_counts$Freq)),])

# telling us that 1996 has 21,094 START_DATEs
# with the year isolated what type of organization or entity is this? let's
# plot ORG_TYPE as a pie chart and take a look at the results for this year

stage1 <- subset(florida, (START_DATE > as.Date("1995-12-31")))
ninetysix <- subset(stage1, ( START_DATE < as.Date("1997-01-01")))

# we now have a dataset containing only STATE_DATEs in 1996 in 'ninetysix'

org_types <- as.data.frame(table(ninetysix$ORG_TYPE))
pie(org_types$Freq, main="1996 Registration Orgs", labels=org_types$Var1)

# this does not pie chart well due to PRIVATE being almost 40 times
# bigger than the next largest but what if we just zero it out

org_types$Freq[17] <- 0
pie(org_types$Freq, main="1996 Registration Orgs", labels=org_types$Var1)

Saturday, October 29, 2016

Exploring the ggplot2() Library in R

The R language ggplot2() library includes a number of basic and intricate plotting and graphics capabilities to aide in visualizing data. Included within the ggplot2() library are also a number of datasets to help demonstrate and explore the capabilities.

Using the included diamond data set it is possible to explore several of the graph and plot types built in to the ggplot2() library. The basic structure is the call to ggplot, passing it the dataset, isolating specific variables for plotting, the type of graph or plot, and options. The following commands plot histograms, frequency polygons, line plots, and statistical smoothing.

Starting with a regular histogram on the number of diamonds based on carat weight:

Adding the factored variable "clarity" on the same histogram plot of carat vs count the ggplot() library colorizes the clarity values for each carat bin.

We can use clarity as our factor variable in geom_freqpoly(), as well, for a slightly different view of the data compared to the histogram:

For the next graph we can scatter plot the carat weight vs price, include a fitted line to scatter plot but including a call to stat_smooth():

Reviewing the documentation for the ggplot2() library the map_data() function caught my eye. Bringing up the Help for map_data within R Studio offers details on the arguments and a small sample R script plotting a set of sample data for fifty states. However the map projection rendered in the sample script is only of the continental 48 states.

Within this set of libraries is also the necessary data and code for plotting individual states and major cities in those states. Creating a subset of data for Florida and Florida cities it was simple to plot a map of Florida and those major cities.

Passing elements of lat and long from this fl_cities dataset to ggplot, the borders details for plotting the state and county borders for the state of Florida, specify that each city should be plotted with a single point, providing a title of “Major Florida Cities”, and finally the X and Y axis labels of Longitude and Latitude results in a “graph” of a map of the cities, counties, and state.

Tuesday, October 18, 2016

Introduction to Open Source R for Visualization

In this exercise I downloaded and installed the R language and R Studio in order to perform some basic analysis and visualization on a set of nine numbers. To get started I installed R base version 3.3.1 downloaded from https://cran.rstudio.com/bin/windows/base/ and then R Studio version 0.99.903, as downloaded from www.rstudio.com/products/rstudio/download.

To assign values to variables in R the symbol ‘=’ or the combo of symbols ‘<-‘ can be used. For this exercise the values being assigned are 10, 20, 30, 40, 50, 60, 70, 80, and 81. This set of values can be passed to our variable holding the values by using the R function c(). This function c() is a generic function that combines or concatenates its arguments, so by assigning c(10, 20, 30, 40, 50, 60, 70, 80, 81) to our variable the concatenated series of numbers in an array.

To experiment with R through R Studio instructions can be entered directly into the Console:

We can also look in the Environment window of R Studio to see that the numbers are assigned to variable “values” and that these numbers are stored as type numeric in an array indexed 1 to 9:

R makes it dig right into analysis and visualization; we can create a pie chart simply by calling the function pie() with our variable, values:

The labels in this pie chart are the index values of the array and this is not easy to read or to understand. We can create our own labels by calling the function names() with our variable as the argument and pass names() the labels to assign. For simplicity we will assign the values to be the labels and then call pie(values) again for a new plot:

We can also plot the variable “values” as a bar chart by calling the function barplot():

We can now take these instructions are store them in an R script: