Data Science and Visualization Explorations: Some Visualizations of The 2016 Election

The outcome of the 2016 Presidential Election turned out to surprise most political analysts and pollsters. While it had been considered a close race with, as it turned out, the popular vote conflicting with the Electoral College results the conventional wisdom among the experts was that Hillary Clinton would win the election. On election night Donald Trump won by an Electoral College landslide prompting many of those some experts to question how and why their polls and analysis had incorrectly predicted the outcome.

Obtaining The Data

Analysis cannot be performed without data so our first step was to obtain election results data. Emil Kirkegaard, a self-described statistician from Denmark, posted a dataset soon after the election at https://github.com/Deleetdk/USA.county.data containing county-level election results combined with other county-level demographics data in per-row CSV format. After taking a look at the demographic data there was one set of county data missing that I wanted to include in my analysis: population density. This data is available from the US Government’s Census website http://www.census.gov/quickfacts/meta/long_LND110210.htm consisting of data from 2010.

Merging and Cleaning The Data

The dataset from Mr. Kirkegaard needed to be merged with the file downloaded from the Census website. The Census data required the most effort in terms of converting and cleaning followed by merging row-wise with the election results data, removing Census data rows not having matching county rows. The election results file, for example, did not have county data for Alaska requiring the deletion of all county rows from the Census data. The resulting merged and cleaned data set contains 3111 observations with 129 variables, with some variables (or columns) containing duplicate data.

Importing In to R for Analysis and Visualization

The free and open source R language provides a powerful and simplified means of importing and manipulating datasets such as this without having to purchase commercial solution such as Tableau. After importing the data in CSV format we can confirm that R properly interpreted most of the columns without any guidance from us with the exception of the “trump_win” column which was interpreted as an integer but we want to have it instead considered a category or “factor” within R.

Starting Basic: One State's Data

Before we dive in too deep and before we've explored the dataset let's take one state, Maryland, and with R perform some very basic visualizations. Maryland is a solid "Blue" state, having voted for a Democratic President in the vast majority of elections the past century. Does that mean that every county in Maryland votes "Blue"? And how strongly "Blue" is each county? To explore this we will take the number of votes for Clinton and subtract the number of votes for Trump and plot the results by county on a bar chart.

The results indicate three counties did vote strongly for Clinton but more individual counties (16 counties Trump vs 7 counties Clinton) had slightly more votes for Trump than for Clinton.

Baseline Test Plot With Maps

Having tested the dataset waters by zeroing in on a single state we can now stretch a bit further. To begin with let’s plot the counties won by each candidate. Note that this does not directly map to the Electoral College wins since some counties may have a large area but not have a proportionate impact on the Electoral College vote due to their population density. In R we can use the factor data in the variable $trump_win to quickly plot a map of the election results in our data set. When the results are plotted we can also quickly see the counties for which we are missing data but overall it looks like we are in decent shape.

We have now tested our ability to take a single binary value ($trump_win) and plot it county-wise on a map of the United States. We can now move on to trying to explore some other values and ideas using R and this visualization capability.

Voter Defections or Failure to Show at the Polls

A common theme in the media and among much of the general public was that both candidates were disliked to a greater degree than in any other recent election. Did this keep voters away from the polls? Or in some measures a worse thought, did registered voters for one party switch and vote for the other party's candidate? Using raw numbers it can be harder to determine the later but statistics are generally available to determine overall voter turnout across all registered voters but can we use this dataset to try to get a sense of Democrats who stayed away or voted for the Republican candidate and vice versa?

To attempt this we will take the number of votes for each candidate and subtract the number of registered voters for that party as of 2012. This is not going to be a very good measure because media reports indicated that large numbers of people registered to vote before the 2016 election so, in theory, those 2012 numbers should be lower than the actual 2016 numbers that are not in our dataset.

We will start with the Democrats and use a choropleth graphing technique to create six sets of bins for values representing the difference of the votes for Clinton and the number of registered Democrats in each county. The plot below is a bit difficult to understand due to how R created the bins evenly across the range of values and the colors it selected. The bright purple and blue areas reflect the largest increases in votes for a Democrat after subtracting out the number of registered Democrats from 2012. This would imply increased voter registrations in 2016 not collected by our data and/or Republicans switching sides and voting for the Democratic candidate in those counties. It's worth noting the strong purple segment running straight up the center of the country from Texas to Nebraska as well as large segments of Utah and Florida.

The olive green and pink areas reflect a negative value after subtracting the number of registered Democrats from the votes for the Democratic candidate. It is curious that large segments seem to show a negative turnout in California, Washington state, and a lot of the counties in the Northeast, areas generally considered strongly Democratic.

What happens when we do the same analysis for the Republicans? That segment of land from Texas up to Nebraska has an olive green vein running through it but note that R assigned less extreme values to this color segment for this data so while olive green is not a "good" value it does not represent as strong of a negative value as it did on the Democratic map. That said, it does look like the Democrats strengthened in the Midwest counties even though they ultimately lost the Electoral College votes for those states in 2016.

Party Differences by County

Another conventional wisdom of this election is that the country is more polarized and that those living in the more densely populated urban areas are living in a different America than those in the more rural areas. Does this mean that rural and urban counties are more heavily weighted to one party than the other? One probably less-than-ideal way to explore this could be to subtract the number of registered voters of one party from the registered voters of the other and see the difference as such:

In the above plot I subtracted the number of 2012 registered Democrats from the 2012 registered Republicans, meaning that from olive green to reddish-pink counties are more Democratic and everything from the green to blue to purplish-pink counties are more Republican.

A better measure would be to use a percentage of the registered voters:

This is based on the percentage of registered Republicans in 2012 so roughly green through the purplish-pink are greater than 50% Republicans while olive green and reddish-pink are greater than 50% registered Democrats.

What Role Does Population Density Play?

It seems that as areas become more dense in population those areas seem to vote Democratic. Can we actually see that influence in the data? This is why I imported the additional rows of county data from the census.gov website as the had data rows on population density per square mile of land. Is it possible to take that population density figure and combine it with the voting data to create some sort of metric? In R as a test I took the population density figure and divided it by the difference of the number of votes for the Republican candidate subtracted by the number of votes for the Democratic candidate. The theory being that in large population density areas that voted Democratic those large numbers would remain large because they would be closer to one or would turn negative. In less dense areas those numbers would become very small as they would be divided by the larger number remaining after having its Democratic votes subtracted from the Republican votes.

This plot does show a number of counties with negative values and a good number of counties are also olive green which is hovering right around zero. In fact a lot of the values are very near zero with only the far ends of the value range being far removed from zero. It's so bad that a box plot of this value in R is not even recognizable as a boxplot:

This is not a good measure of anything in its current form but could be explored further taking into account some other factors or using a better normalizing method on the values. What if we didn't try to make a range of six categories but instead simply went with a binary value and tried to use population density itself as a deciding factor and compare that to the vote results? We could do this by reducing the number of categories to 2 and then use R to calculate the accuracy of our results against the actual votes.

A resulting accuracy that is basically the same thing as a random guess. It would appear a blind application of population density is insufficient. But there is a lot of other demographic data in this data set that could be integrated in to some regression or classifier algorithms to possibly generate a more accurate prediction.

library(maps)

library(ggplot2)

# read in the datafile

election <- read.csv(file.choose())

# change the trump_win value to be a factor

election$trump_win <- as.factor(election$trump_win)

# change the 'name' variable to 'county_name'

colnames(election)[2] <- "county_name"

# change the values in 'county_name' to all lower case for

# easy merge with the map data later

election$county_name <- tolower(election$county_name)

# change the 'ST' variable to 'state' for easy merge with map data later

colnames(election)[11] <- "state"

county_map <- map_data("county")

names(county_map) <- c("long", "lat", "group", "order", "state_name", "county_name")

county_map$state <- state.abb[match(county_map$state_name, tolower(state.name))]

county_map$state_name <- NULL

state_map <- map_data("state")

# combine county and state borders

choropleth <- merge(county_map, election, by = c("state", "county_name"))

choropleth <- choropleth[order(choropleth$order), ]

# to help ggplot with an inheritence issue due to column name

# confusion introduced by having two columns named 'lat' before the merge

choropleth$lat <- choropleth$lat.x

###

# we now have the general structure laid out for the data frames

# to be used later but before we get too into that let's pick a

# state and do a basic look at the voting statistics by county.

# we will take Maryland, a solid Democrat voting state and take

# a look at how "blue" or not it is

maryland <- subset(election, state == "MD")

ggplot(maryland, aes(county_name , results.clintonh - results.trumpd)) + geom_bar(position = "dodge", stat="identity", aes(group = county_name)) + coord_flip()

# we've subtracted the number of Trump votes from the number of

# Clinton votes and plotted that on a bar chart to see how strongly

# each county voted

# now let's look at a national level map

# to begin with let's plot the counties won by each candidate

choropleth$plotv <- choropleth$trump_win

# plot the map of the election results by county

ggplot(choropleth, aes(long, lat, group = group)) +

geom_polygon(aes(fill = plotv)) + scale_fill_manual(values = alpha(c("blue", "red"))) +

geom_polygon(data = state_map, color = "white", fill = NA) +

ggtitle("2016 Election Results by County") +

labs(fill = "Party")

###

# Question 1: Votes that seemed to switch sides

# Starting with Republican votes

###

choropleth$votes_v_regist <- choropleth$results.trumpd - choropleth$rep12

# Create a scale for use with a Brewer color scheme with 6 levels

choropleth$plotv <- cut_number(choropleth$votes_v_regist, 6)

ggplot(choropleth, aes(long, lat, group = group)) +

geom_polygon(aes(fill = plotv), color = alpha("white", 1/2), size = 0.2) +

geom_polygon(data = state_map, color = "white", fill = NA)

###

# And now with the Democratic votes

###

choropleth$votes_v_regist <- choropleth$results.clintonh - choropleth$dem12

# Create a scale for use with a Brewer color scheme with 6 levels

choropleth$plotv <- cut_number(choropleth$votes_v_regist, 6)

ggplot(choropleth, aes(long, lat, group = group)) +

geom_polygon(aes(fill = plotv), color = alpha("white", 1/2), size = 0.2) +

geom_polygon(data = state_map, color = "white", fill = NA)

###

# What sort of difference is there in the # of registered

# voters for each party?

###

choropleth$regist <- choropleth$rep12 - choropleth$dem12

# Create a scale for use with a Brewer color scheme with 6 levels

choropleth$plotv <- cut_number(choropleth$regist, 6)

ggplot(choropleth, aes(long, lat, group = group)) +

geom_polygon(aes(fill = plotv), color = alpha("white", 1/2), size = 0.2) +

geom_polygon(data = state_map, color = "white", fill = NA)

# repeat above but only using the percentage of 2012 Republicans

choropleth$regist <- choropleth$rep12_frac

# Create a scale for use with a Brewer color scheme with 6 levels

choropleth$plotv <- cut_number(choropleth$regist, 6)

ggplot(choropleth, aes(long, lat, group = group)) +

geom_polygon(aes(fill = plotv), color = alpha("white", 1/2), size = 0.2) +

geom_polygon(data = state_map, color = "white", fill = NA)

###

# influence of population density on vote outcome

###

choropleth$votedensity <- as.numeric(choropleth$Population.Density.per.square.mile.of.land.area) / (choropleth$results.trumpd - choropleth$results.clintonh)

# Create a scale for use with a Brewer color scheme with 6 levels

choropleth$plotv <- cut_number(choropleth$votedensity, 6)

ggplot(choropleth, aes(long, lat, group = group)) +

geom_polygon(aes(fill = plotv), color = alpha("white", 1/2), size = 0.2) +

geom_polygon(data = state_map, color = "white", fill = NA)

# the resulting map looks a lot like the map of the voting outcome so perhaps

# this isnt really telling us much. what does it look like on a 6 level scale

# with only the population density?

choropleth$votedensity <- as.numeric(choropleth$Population.Density.per.square.mile.of.land.area)

# Create a scale for use with a Brewer color scheme with 2 levels

# because six was too busy to try to understand and 2 will also

# be easier to compare to the election results map

choropleth$plotv <- cut_number(choropleth$votedensity, 2)

ggplot(choropleth, aes(long, lat, group = group)) +

geom_polygon(aes(fill = plotv), color = alpha("white", 1/2), size = 0.2) +

geom_polygon(data = state_map, color = "white", fill = NA)

# what is our accuracy if we just use pop density to predict vote direction?

choropleth$predictions <- as.numeric(choropleth$plotv) - 1

misclassified <- mean(choropleth$predictions != choropleth$trump_win)

print(paste('Accuracy is ', 1 - misclassified))

# so in reality it doesnt look better than chance relying simply on the

# density but I do wonder if the very small counties in some of the midwest

# states skews the result and we should find some more normalizing methods?

Data Science and Visualization Explorations

Wednesday, November 30, 2016

Some Visualizations of The 2016 Election

No comments:

Post a Comment