Data Science and Visualization Explorations: September 2016

Sunday, September 25, 2016

Google Fusion Tables and Medicare Fine Data

The federal government makes datasets on Medicare facilities, survey results, and fines available on https://data.medicare.gov/. I downloaded the dataset on identified deficiencies at nationwide facilities, filtered the data for the state of Florida,

The subset of data filtered on Florida was saved on a local Excel document and then uploaded to Google Fusion Tables. With the creation of the new data column containing the concatenated full address record for each facility Fusion successfully geolocated all 347 records.

With the records imported Fusion Tables rendered the below point map from the 347 records of Florida facilities assessed fines.

After zooming in we can see more detailed view and select individual facility points for details.

Switching to Fusion Table's Heatmap view offers the below visualization of the Florida data.

Sunday, September 18, 2016

For the three images below we determine and describe why each one is either Descriptive, Inferential, and Predictive.

This first set of graphs are all Descriptive. The pie chart depicts the distribution of the categories Yellow, Green, and Red in a given population. The bar chart depicts the number of each color category, the histogram shows the number of elements broken out from ages 20 through 27, and the line chart shows the number of new freshman each year between 1994 and 2006. All three are different ways to describe and visually depict measured quantities or events.

This figure is Predictive as it shows scores for three projects over time from left to right and then at the far right of the chart seeks to project the scores into the future. This chart aims to use some type of analytics to predict future scores based on some unspecified inputs. Linear regressions are the most common form of prediction or forecasting, a family of techniques that typically seeks to fit a line with a slope to the known measured data in order to predict the future values.

This final visualization is inferential as it is drawing from results sampled from a population in order to assess the opinions of the population as a whole. It is often not possible to measure the opinions of an entire population and in cases where it may be possible to measure an entire population it is often more effective and suitably accurate to take a smaller sample of the population instead of trying to measure the entirety.

Monday, September 12, 2016

In Robert Rouse’s blog post on the Interworks site the author explored the evolution of political divides in the United States over the past 20 years based on Pew Research Center data. Mr. Rouse created the visualization using Tableau and in his blog discussed some of the design and implementation decisions he made along the way and noting sources of inspiration from other visualizations he had seen.

Mr. Rouse spends a couple paragraphs on his color use, the special issues of color that come up when discussing political party issues, and finding neutral colors that would not introduce any bias on political topics or party affiliations. Selecting colors, the design of titles and annotations, and the layout of the visualization all required careful thought to avoid bias pitfalls, distractions, and the misleading or ambiguous presentation of data.

Source: https://www.interworks.com/blog/rrouse/2016/06/24/politics-viz-contest-plotting-political-polarization

In Enrico Bertini’s blog “Fell in Love with Data” entry concerning “Data Visualization or Data Interaction?” the author discusses the value of visualization of data in an interactive exploratory mode to use the computer as a means to spot features in the data and then “ask” further questions based on an iterative discovery and visualization process.

Source: http://fellinlovewithdata.com/uncategorized/data-interaction

In Raffael Marty’s post he made available the slide deck he used as a recent security conference. Marty has worked on visualization of network security data for many years and wrote a book, Applied Security Visualization, in 2008 where he presents a variety of different visualizations for different types of information security logs and metadata. In his presentation he talks about the different uses of visualization, such as present vs discover, the use of data mining techniques to process large amounts of data and pull out statistical outliers, and then using different visualization techniques to allow the human to interact with the data to find potential malicious activity of concern.

Source: http://raffy.ch/blog/2016/02/09/creating-your-own-threat-intel-through-hunting-visualization/

Sunday, September 4, 2016

Working with CSV Data

The “FL_ORGANIZATION_FILE.csv” dataset file is a large 37MB CSV formatted file from an Apple O/S with 257,567 rows of data plus one header row and 24 columns. Microsoft’s Excel 2013 can load it despite this large size. Per Microsoft’s specifications Excel 2016 and Excel 2013 can handle worksheets up to 1,048,576 rows by 16,384 columns ( source ).

Google Sheets was unable to import the file as a CSV into a blank worksheet. Sheets can handle up to 2 million cells for spreadsheets and this file contains more than 6 million cells (257,568 x 24 = 6,181,632) so it is beyond what Google Sheets can handle. If we wanted to use Google Sheets we could need to divide up the data to fit below the 2 million cells per file or identify columns we could exclude from the work. If we could limit ourselves to 7 columns we could fit within the 2 million cell limit in one file, for example.

R was able to read in the file:

Finding ways to visualize this data from a spreadsheet or CSV form can be a challenge. Trying to figure out what the start with; what is it we want to find out about this data? Geographic density based on the city, state, or zipcode? Time-based statistics on start or end dates for organizations? Interest types? Identifying multiple registrations based on mailing address, phone number, email, etc?

For one exploration I plotted the distribution of zipcodes present in the data in R:

This is a rough visualization but provides some starting points for understanding some of the data in the CSV file. We can see there is a concentration of zipcodes in the 3100-3600 range, which may make sense for Florida registered organizations, but it also demonstrates that there are a number of Florida registered organizations with mailing address zipcodes outside of the "expected" range.