Sunday, September 25, 2016
Google Fusion Tables and Medicare Fine Data
The federal government makes datasets on Medicare facilities, survey results, and fines available on https://data.medicare.gov/. I downloaded the dataset on identified deficiencies at nationwide facilities, filtered the data for the state of Florida,
The subset of data filtered on Florida was saved on a local Excel document and then uploaded to Google Fusion Tables. With the creation of the new data column containing the concatenated full address record for each facility Fusion successfully geolocated all 347 records.
With the records imported Fusion Tables rendered the below point map from the 347 records of Florida facilities assessed fines.
After zooming in we can see more detailed view and select individual facility points for details.
Switching to Fusion Table's Heatmap view offers the below visualization of the Florida data.
Sunday, September 18, 2016
For the three images below we determine and describe why each one
is either Descriptive, Inferential, and Predictive.
This first set of graphs are all Descriptive. The pie chart
depicts the distribution of the categories Yellow, Green, and Red in a given
population. The bar chart depicts the number of each color category, the
histogram shows the number of elements broken out from ages 20 through 27, and
the line chart shows the number of new freshman each year between 1994 and
2006. All three are different ways to describe and visually depict measured quantities
or events.
This figure is Predictive as it shows scores for three
projects over time from left to right and then at the far right of the chart seeks
to project the scores into the future. This chart aims to use some type of
analytics to predict future scores based on some unspecified inputs. Linear regressions
are the most common form of prediction or forecasting, a family of techniques
that typically seeks to fit a line with a slope to the known measured data in
order to predict the future values.
This final visualization is inferential as it is drawing
from results sampled from a population in order to assess the opinions of the
population as a whole. It is often not possible to measure the opinions of an
entire population and in cases where it may be possible to measure an entire
population it is often more effective and suitably accurate to take a smaller
sample of the population instead of trying to measure the entirety.
Monday, September 12, 2016
In Robert Rouse’s blog post on
the Interworks site the author explored the evolution of political divides in
the United States over the past 20 years based on Pew Research Center data. Mr.
Rouse created the visualization using Tableau and in his blog discussed some of
the design and implementation decisions he made along the way and noting
sources of inspiration from other visualizations he had seen.
Mr. Rouse spends a couple
paragraphs on his color use, the special issues of color that come up when
discussing political party issues, and finding neutral colors that would not
introduce any bias on political topics or party affiliations. Selecting colors,
the design of titles and annotations, and the layout of the visualization all
required careful thought to avoid bias pitfalls, distractions, and the misleading
or ambiguous presentation of data.
In Enrico Bertini’s blog “Fell
in Love with Data” entry concerning “Data Visualization or Data Interaction?”
the author discusses the value of visualization of data in an interactive
exploratory mode to use the computer as a means to spot features in the data
and then “ask” further questions based on an iterative discovery and
visualization process.
In Raffael Marty’s post he
made available the slide deck he used as a recent security conference. Marty
has worked on visualization of network security data for many years and wrote a
book, Applied Security Visualization, in 2008 where he presents a variety of
different visualizations for different types of information security logs and
metadata. In his presentation he talks about the different uses of
visualization, such as present vs discover, the use of data mining techniques
to process large amounts of data and pull out statistical outliers, and then
using different visualization techniques to allow the human to interact with the
data to find potential malicious activity of concern.
Source: http://raffy.ch/blog/2016/02/09/creating-your-own-threat-intel-through-hunting-visualization/
Sunday, September 4, 2016
Working with CSV Data
The “FL_ORGANIZATION_FILE.csv” dataset file is a large 37MB CSV
formatted file from an Apple O/S with 257,567 rows of data plus one header row
and 24 columns. Microsoft’s Excel 2013 can load it despite this large size. Per
Microsoft’s specifications Excel 2016 and Excel 2013 can handle worksheets up
to 1,048,576 rows by 16,384 columns ( source ).
Google Sheets was unable to import the file as a CSV into a
blank worksheet. Sheets can handle up to 2 million cells for spreadsheets and
this file contains more than 6 million cells (257,568 x 24 = 6,181,632) so it
is beyond what Google Sheets can handle. If we wanted to use Google Sheets we
could need to divide up the data to fit below the 2 million cells per file or
identify columns we could exclude from the work. If we could limit ourselves to
7 columns we could fit within the 2 million cell limit in one file, for
example.
Finding ways to visualize this data from a spreadsheet or
CSV form can be a challenge. Trying to figure out what the start with; what is
it we want to find out about this data? Geographic density based on the city,
state, or zipcode? Time-based statistics on start or end dates for
organizations? Interest types? Identifying multiple registrations based on
mailing address, phone number, email, etc?
For one exploration I plotted the distribution of zipcodes
present in the data in R:
Subscribe to:
Posts (Atom)