Wednesday, November 2, 2016

Attempts at Visualizing a Large Dataset


In this post we at a look at three visualizations from a large data set of 257,567 rows with 24 variables per row consisting of a mix of numeric, categorical, and geo data. The task is to create at least one bar chart, line graph, and pie chart and we will be performing these tasks with the R language.

The data set is in the form of a CSV formatted file containing organizational registration data for the state of Florida that we read in and then create a new subset of data to explore the question how many organizations are not listed with a Florida address. With that subset of data stored in "outofstate" we can directly plot this with ggplot() as a bar chart.




The first and largest column on the far left is for values that had no state listed for their registration address. The highest frequency state was GA followed by TX.

Next we take a look at the data as a time series based on the registered START_DATE variable. We have to reinterpret the START_DATE field as a formatted date value and it's apparent that while data exists for the 1930's through to future dates the majority of the registrations are in the 1980s through to current day. For our plot we will limit the date range to 1970-2016 and create a line plot on the number of registrations each year between 1970 and 2016. The first attempt at the plot was impossible to read because of the scaling of the y-axis due to one value that was massively larger than all other values. To reduce the impact of this one value on the scale of the rest of the plot we changed the y-axis to be on log scale.


Looking at this plot the number of registrations in the data set in the 1970s was relatively light and consistent but really took off in the 1980s, became more regular in frequency and volume in the late 1980s, but at some point in the mid-to-late 1990s is where we have that one year with a significantly larger value that prompted us to use the log scale for the y-axis. Which year was that? With the plot options and scaling it is not possible to tell so we'll use R to take a look and ir looks like it is 1996.

With the year isolated what is the make up of the organization types registering in 1996? We will create a new data set containing only rows for 1996 with two subset instructions, then another data set containing only the ORG_TYPE and counts of those values for 1996. This data set can be directly used with pie() to create a pie chart.



Once again we have one value that is ruining the scale and ability to visualize the organizations; PRIVATE is far and away the most frequent value for registered entities in 1996. Let's remove PRIVATE from the data set we should be able to get some better sense of the other values present.


This allows us to get a better sense of the organizations that were not PRIVATE, although the way R displays the labels could be improved through the use of a legend to the side of the pie chart instead of attempting to label the slices directly.

R Source Code:

# Assignment:
#  Based on the data, create three different types of visualizations 
#  that include: Bar Chart, Line Graph, and Pie chart. Under each type 
#  of visualization, provide a short summary and discuss what attributes 
#  the visualization provides for understanding the data
#

library(ggplot2)

florida <- read.csv(file.choose())

# How many registered entities have registered addresses outside
# of Florida?

outofstate <- subset(florida, !(STATE_CODE %in% c("FL")))
# ggplot can count up how many different values there are for STATE_CODE
# and then plot the results in a bar chart for us
ggplot(outofstate, aes(outofstate$STATE_CODE)) + geom_bar(stat="count") + ggtitle("Out of State Registrations") + xlab("State")

# Are there any trends, bursts, or drop-offs for the registration
# start dates? What do they look like if we plot them with a line
# graph? This would be a plot of time series data binned by date

# fix the date format from levels to a date format understood by R
florida$START_DATE <- as.Date(florida$START_DATE, format="%d-%b-%y")

# now let's limit ourselves to 1970-2016 because there is some odd
# data in this set as well as 72,620 records without a valid START_DATE
# and we have to plot the y-axis on a log scale because of an
# unusually large number of values in the late 1990's
ggplot(florida, aes(florida$START_DATE)) + geom_line(stat="count") + scale_x_date(limits = c( as.Date(c("1970-01-01")), as.Date(c("2016-12-31")))) + coord_trans(y="log10")


# let's isolate that strange year in the mid/late 1990s with the surge
# of what appears to be new registrations

# our earlier ggplot was a bit more complicated because it directly accessed the full data
# but we can be clearer in our code and effort if we strip some details down
# so let's create a vector with only the year of the START_DATE

isolate <- as.Date(florida$START_DATE, format = "%Y-%m-%d")

# and convert it from a date format to an integer
isolate <- as.numeric(isolate)

# now let's create a summary table with counts by year
year_counts <- as.data.frame(table(isolate))

# and the top six years for entries is
head(year_counts[rev(order(year_counts$Freq)),])

# telling us that 1996 has 21,094 START_DATEs
# with the year isolated what type of organization or entity is this? let's
# plot ORG_TYPE as a pie chart and take a look at the results for this year

stage1 <- subset(florida, (START_DATE > as.Date("1995-12-31")))
ninetysix <-  subset(stage1, ( START_DATE < as.Date("1997-01-01")))

# we now have a dataset containing only STATE_DATEs in 1996 in 'ninetysix'

org_types <- as.data.frame(table(ninetysix$ORG_TYPE))
pie(org_types$Freq, main="1996 Registration Orgs", labels=org_types$Var1)

# this does not pie chart well due to PRIVATE being almost 40 times
# bigger than the next largest but what if we just zero it out

org_types$Freq[17] <- 0
pie(org_types$Freq, main="1996 Registration Orgs", labels=org_types$Var1)

No comments:

Post a Comment