Data Science and Visualization Explorations: October 2016

Saturday, October 29, 2016

Exploring the ggplot2() Library in R

The R language ggplot2() library includes a number of basic and intricate plotting and graphics capabilities to aide in visualizing data. Included within the ggplot2() library are also a number of datasets to help demonstrate and explore the capabilities.

Using the included diamond data set it is possible to explore several of the graph and plot types built in to the ggplot2() library. The basic structure is the call to ggplot, passing it the dataset, isolating specific variables for plotting, the type of graph or plot, and options. The following commands plot histograms, frequency polygons, line plots, and statistical smoothing.

Starting with a regular histogram on the number of diamonds based on carat weight:

Adding the factored variable "clarity" on the same histogram plot of carat vs count the ggplot() library colorizes the clarity values for each carat bin.

We can use clarity as our factor variable in geom_freqpoly(), as well, for a slightly different view of the data compared to the histogram:

For the next graph we can scatter plot the carat weight vs price, include a fitted line to scatter plot but including a call to stat_smooth():

Reviewing the documentation for the ggplot2() library the map_data() function caught my eye. Bringing up the Help for map_data within R Studio offers details on the arguments and a small sample R script plotting a set of sample data for fifty states. However the map projection rendered in the sample script is only of the continental 48 states.

Within this set of libraries is also the necessary data and code for plotting individual states and major cities in those states. Creating a subset of data for Florida and Florida cities it was simple to plot a map of Florida and those major cities.

Passing elements of lat and long from this fl_cities dataset to ggplot, the borders details for plotting the state and county borders for the state of Florida, specify that each city should be plotted with a single point, providing a title of “Major Florida Cities”, and finally the X and Y axis labels of Longitude and Latitude results in a “graph” of a map of the cities, counties, and state.

Tuesday, October 18, 2016

Introduction to Open Source R for Visualization

In this exercise I downloaded and installed the R language and R Studio in order to perform some basic analysis and visualization on a set of nine numbers. To get started I installed R base version 3.3.1 downloaded from https://cran.rstudio.com/bin/windows/base/ and then R Studio version 0.99.903, as downloaded from www.rstudio.com/products/rstudio/download.

To assign values to variables in R the symbol ‘=’ or the combo of symbols ‘<-‘ can be used. For this exercise the values being assigned are 10, 20, 30, 40, 50, 60, 70, 80, and 81. This set of values can be passed to our variable holding the values by using the R function c(). This function c() is a generic function that combines or concatenates its arguments, so by assigning c(10, 20, 30, 40, 50, 60, 70, 80, 81) to our variable the concatenated series of numbers in an array.

To experiment with R through R Studio instructions can be entered directly into the Console:

We can also look in the Environment window of R Studio to see that the numbers are assigned to variable “values” and that these numbers are stored as type numeric in an array indexed 1 to 9:

R makes it dig right into analysis and visualization; we can create a pie chart simply by calling the function pie() with our variable, values:

The labels in this pie chart are the index values of the array and this is not easy to read or to understand. We can create our own labels by calling the function names() with our variable as the argument and pass names() the labels to assign. For simplicity we will assign the values to be the labels and then call pie(values) again for a new plot:

We can also plot the variable “values” as a bar chart by calling the function barplot():

We can now take these instructions are store them in an R script:

Saturday, October 15, 2016

plot.ly for Basic Statistical Measures

This assignment was using plot.ly to produce some basic statistical measures on a small data set of essentially five variables with eight observations. An error was noticed in the Girls data table: the value for Total was reported lower than it actually was. I was not planning on keeping these total observations anyhow since they did not add any value but curiously the real total values were almost identical to the total values for Boys with the first total value off by only 0.1.

After thinking about the data provided and the types of results I could generate with plot.ly I decided to merge the data into a single table and create a variable for Girls/Boys as a 0/1 value (category) that we'd be using as our dependent variable.

Goals	Grades	Popular	Sports	Girl/Boy
4	49	24	19	0
5	50	36	22	0
6	69	38	28	0
4	46.1	26.9	18.9	1
5	54.2	31.6	22.2	1
6	67.7	39.5	27.8	1

The primary benefit was that this made is easier to perform a column correlation between each variable with the Girl/Boy column to see if there were any positive or negative correlations with a variable to the group’s sex. As there are only three sets of data for each variable from each sex there is not much data to work with and a direct visual inspection shows that goals, for example, has identical values for Girls and Boys so there is nothing in that variable to differentiate between the sexes.

The other variables have some variation between the sexes. Performing a column correlation of grades with sex we obtain a tiny number:

Running the same column correlation on popular with sex results in a zero value, showing that even though the values themselves are different between the two sexes those differences were not sufficient to show a correlation. A column correlation on sports with sex results in a small negative value of -0.004495.

A weakness of taking a single column at a time and testing it for correlation to another column is that we are not taking the full set of data into consideration. That, combined with the small sample size, makes it difficult to generate any results of significance.

If seeking to determine what is the most important variable it would be beneficial to have a better understanding of what the values of the observations as measuring and on what sort of scale. We can see that the values for Goal observations are the smallest values and the observations for Grades are the largest values. It also appears that for each set of six observations the combined totals range from 95.9 to 141, leaving it unclear what sort of scoring scale the participants were using as they assigned these values to what was most important to them.

Taking the totals for each set of observations and looking at the calculated min, max, mean, median, standard deviation, and variance we have the following values:

The variable with some interesting difference between the sexes was the Popularity measure but as the statistical measures below reveal the reported values average out to the same statistical mean. But since there are only three observations per sex the median value is not necessarily a good measure for assessing a larger population since the median of three observations is simply the 2nd highest value. In this example it could present a misleading result for the Girls since the Girls median value is almost 5 points larger than the median for the boys but the Girls also had a larger variance in their values than the boys.

Wednesday, October 5, 2016

plot.ly

In this post I explore some of the functionality and options of the plot.ly service as applied to a small set of celebrity social media follower data.

The pie chart view is not particularly useful or effective. Plotting the Facebook numbers for each celebrity generates the below display. It is difficult to visually compare the numbers between the celebrity values. Evaluating each segment takes much more time and mental effort than other graph options.

Switching to a bar chart we get the below display, which is easier to gauge the differences between the measures of each celebrity account. This view would be even easier to quickly digest if the numeric values were sorted before plotted.

The prior two graphs were created using the plot.ly 2.0 service but some of the analysis and statistical functions are not implemented or as easy to access. Reverting to the original plot.ly we can more easily sort the values before plotting and it was more obvious to plot the Facebook and Twitter values in the y-axis to generate the below graph.

The disparity between Twitter and Facebook statistics for Shakira and Rihanna is interesting and begs the question as to why it may be that fans for those two artists are not following the artists on Twitter at a similar rate to Facebook.

From within plot.ly it is also possible to generate descriptive statistics. An example from the Twitter data is below, showing a mean number of Twitter followers as 36,042,208.5, median of 37,133,201, and standard deviation of 7,783,594.3.

Let’s take a look at this summary statistics in a bar plot, minus the variance because the value is too large and would ruin our scale.

Sunday, October 2, 2016

Social Media

The use of the NodeXL package was not possible due to the need to purchase the professional version in order to connect to Twitter or other social media platforms. Wolfram Alpha was apparently having their own problems on their server backend as their web-based application at first seemed to successfully connect with my Facebook profile but then failed indicating their system was having trouble and to try again later. When visiting the site later the Wolfram site did not seem to connect to my Facebook profile but instead presented the below page not found message.

This led me to check out the Gephi application for Windows. After installing Gephi and then the latest version of Java I was finally able to connect to social media APIs to pull in data. Returning to my Twitter account I added this application to my authorized apps and input the necessary API keys to Gephi for it to retrieve live streaming Twitter data.

For the first test I set the “words to follow” as “#debatenight,” applied that to the Full Twitter Network, and then connected to Twitter. Tweets began to populate the data table and after allowing several minutes of data to build up I could inspect the graph showing the interconnections of the users and the contents of their tweets.

From the graph view it is possible to grab specific nodes, move them around or rearrange the positioning, and to select a node in order to then have it highlighted in the Data Table to inspect its values or content. To clean up the graph Gephi provides the ability to collapse or remove duplicate data. With that accomplished it becomes a little easier to zero in on centers of activity.

The node with the most edges is, not surprisingly, the term used to perform the query. This node has been pulled to the far left of the graph display in the above image, making it easier to explore other relationships present within the data. Since the data was extracted using a popular hashtag trending on Twitter it contains posts from a wide variety of users and media sites who otherwise do not have strong social relationships, however.