Data Science and Visualization Explorations: From Bad Visualization to Better

Taking a second look at the quality of the design and presentation of a data subset from an earlier post I will seek to improve one visualization from the 2 November 2016 posting “Attempts at Visualizing a Large Dataset” (https://dataviz2016.blogspot.com/2016/11/attempts-at-visualizing-large-dataset.html). Several of the visualizations in this post are of reduced value or utility, largely due to the visualization trying to convey too much information and from differing scales.

The visualization I will address is the pie chart “1996 Registration Orgs.” This was displayed as a pie chart in order to meet other requirements but pie charts but there are many problems with the use of pie charts and this visualization in particular is a terrible use of a pie chart. There are too many categories to display (21 different values), there is one category consuming the vast majority of the pie, it is not possible to read the names of many of the categories, the distribution and selection of colors makes it difficult to differentiate the categories, and the difference in magnitude of several categories makes it impossible to judge what the actual values are for different categories or even determine the difference in magnitude scale between them. In short, pie chart is the wrong visualization type to use.

To create a more useful and readable visualization of the different organizations with a 1996 registration start date I will change this from a pie chart to a bar chart, use a chart title, label the X and Y axis, plot in a flat two-dimensional perspective to prevent any misleading impressions due to the mind's interpretation of angles, area, and scale.

Removing organizations that had an index (or x) value but counts (y values) of zero reduces the clutter in the plot; if the organization has no counts then there is not necessarily a need to plot it since the absence of presence can also reflect a zero value. Using different colors assigned to each organization type also helps differentiate the groups. But when this is plotted the sizable difference in magnitude for "PRIVATE" is a problem as it makes all the other values unreadable and impossible to compare.

Changing the y-scale to a log scale reduces this effect as shown below.

Now we can see all the values, sorted in descending order, and on a log scale to be able to include PRIVATE with all the other organization types. Using the Evergreen and Emery Data Visualization Checklist to compare the rating scores of the above horizontal bar chart to the prior pie chart, presented below again for single screen comparison, reflects an improvement in just about every guideline.

Associated R Source Code:

library(ggplot2)

florida <- read.csv(file.choose())

# How many registered entities have registered addresses outside
# of Florida?

outofstate <- subset(florida, !(STATE_CODE %in% c("FL")))
# ggplot can count up how many different values there are for STATE_CODE
# and then plot the results in a bar chart for us
ggplot(outofstate, aes(outofstate$STATE_CODE)) + geom_bar(stat="count") + ggtitle("Out of State Registrations") + xlab("State")

# Are there any trends, bursts, or drop-offs for the registration
# start dates? What do they look like if we plot them with a line
# graph? This would be a plot of time series data binned by date

# fix the date format from levels to a date format understood by R
florida$START_DATE <- as.Date(florida$START_DATE, format="%d-%b-%y")

# now let's limit ourselves to 1970-2016 because there is some odd
# data in this set as well as 72,620 records without a valid START_DATE
# and we have to plot the y-axis on a log scale because of an
# unusually large number of values in the late 1990's
ggplot(florida, aes(florida$START_DATE)) + geom_line(stat="count") + scale_x_date(limits = c( as.Date(c("1970-01-01")), as.Date(c("2016-12-31")))) + coord_trans(y="log10")

# let's isolate that strange year in the mid/late 1990s with the surge
# of what appears to be new registrations

# our earlier ggplot was a bit more complicated because it directly accessed the full data
# but we can be clearer in our code and effort if we strip some details down
# so let's create a vector with only the year of the START_DATE

isolate <- as.Date(florida$START_DATE, format = "%Y-%m-%d")

# and convert it from a date format to an integer
isolate <- as.numeric(isolate)

# now let's create a summary table with counts by year
year_counts <- as.data.frame(table(isolate))

# and the top six years for entries is
head(year_counts[rev(order(year_counts$Freq)),])

# telling us that 1996 has 21,094 START_DATEs
# with the year isolated what type of organization or entity is this? let's
# plot ORG_TYPE as a pie chart and take a look at the results for this year

stage1 <- subset(florida, (START_DATE > as.Date("1995-12-31")))
ninetysix <- subset(stage1, ( START_DATE < as.Date("1997-01-01")))

# we now have a dataset containing only STATE_DATEs in 1996 in 'ninetysix'

org_types <- as.data.frame(table(ninetysix$ORG_TYPE))
pie(org_types$Freq, main="1996 Registration Orgs", labels=org_types$Var1)

# this does not pie chart well due to PRIVATE being almost 40 times
# bigger than the next largest but what if we just zero it out

#org_types$Freq[17] <- 0
#pie(org_types$Freq, main="1996 Registration Orgs", labels=org_types$Var1)

# remove zero values
cleaned <- subset(org_types, (Freq > 0))
colnames(cleaned) <- c("Org.Type", "Number")

# change the sort order within the plot such that they are in order and
# not randomly placed making it easier to differentiate between the variables

disp <- ggplot(cleaned, aes(x=reorder(Org.Type, Number), y=Number, color, fill=Org.Type)) +
geom_bar(stat='identity', position='identity') + ggtitle("1996 Registered Organization Types") + labs(x="Organization",y="Number of Orgs") + coord_flip()
disp

# plot using a log(y) scale

disp <- ggplot(cleaned, aes(x=reorder(Org.Type, Number), y=Number, color, fill=Org.Type)) +
geom_bar(stat='identity', position='identity') + scale_y_log10(cleaned$Number) + ggtitle("1996 Registered Organization Types") + labs(x="Organization",y="log(Number of Orgs)") + coord_flip()
disp

Data Science and Visualization Explorations

Sunday, November 13, 2016

From Bad Visualization to Better

No comments:

Post a Comment