Sunday, September 4, 2016

Working with CSV Data

The “FL_ORGANIZATION_FILE.csv” dataset file is a large 37MB CSV formatted file from an Apple O/S with 257,567 rows of data plus one header row and 24 columns. Microsoft’s Excel 2013 can load it despite this large size. Per Microsoft’s specifications Excel 2016 and Excel 2013 can handle worksheets up to 1,048,576 rows by 16,384 columns ( source ).





Google Sheets was unable to import the file as a CSV into a blank worksheet. Sheets can handle up to 2 million cells for spreadsheets and this file contains more than 6 million cells (257,568 x 24 = 6,181,632) so it is beyond what Google Sheets can handle. If we wanted to use Google Sheets we could need to divide up the data to fit below the 2 million cells per file or identify columns we could exclude from the work. If we could limit ourselves to 7 columns we could fit within the 2 million cell limit in one file, for example.

R was able to read in the file:





Finding ways to visualize this data from a spreadsheet or CSV form can be a challenge. Trying to figure out what the start with; what is it we want to find out about this data? Geographic density based on the city, state, or zipcode? Time-based statistics on start or end dates for organizations? Interest types? Identifying multiple registrations based on mailing address, phone number, email, etc?

For one exploration I plotted the distribution of zipcodes present in the data in R:






This is a rough visualization but provides some starting points for understanding some of the data in the CSV file. We can see there is a concentration of zipcodes in the 3100-3600 range, which may make sense for Florida registered organizations, but it also demonstrates that there are a number of Florida registered organizations with mailing address zipcodes outside of the "expected" range.




No comments:

Post a Comment