Learning Objectives

Following this assignment students should be able to:

  • understand the basic rules of tidy data
  • implement quality control for data entry in spreadsheets
  • create an SQL database by importing data

Reading

Lecture Notes

  1. Course Introduction
  2. Demo Code for Where Students Can Get in the Course
  3. Tidy Data
  4. Data Entry
  5. Introduction to Databases

Exercises

  1. -- Improving Messy Data --

    A lot of real data isn’t very tidy, mostly because most scientists aren’t taught about how to structure their data in a way that is easy to analyze.

    Download a messy version of some of the Portal Project data. Note that there are multiple tabs in this spreadsheet.

    Think about what could be improved about this data. In a text file (to be turned in as part of the assignment):

    1-5. Describe five things about this data that are not tidy and how you could fix each of those issues.

    6. Could this data easily be imported into a database in its current form?

    7. Do you think it’s a good idea to enter the data like this and clean it up later, or to have a good data structure for analysis by the time data is being entered? Why?

  2. -- Data entry validation in Excel --

    You’re starting a new study of small mammals at the NEON site at Ordway-Swisher. Create a spreadsheet in Excel for data entry. It should have four columns: Year, Site, Species, and Mass.

    Set the following data validation criteria to prevent any obviously wrong data from getting entered:

    1. Year must be an integer between 2015 and 2025.
    2. Site should be one of the following A1, A2, B1, B2.
    3. Species should be one of the following Dipodomys spectabilis, Dipodomys ordii, Dipodomys merriami.
    4. Mass should be a decimal greater than or equal to zero but less than or equal to 500 since mass is measured in grams in this study and nothing bigger than half a kilogram will possibly fit into your Sherman traps. Change the error message on this validation criteria to explain why data is invalid and what the valid values are.

    Save this file as yourname_ordway_mammal_data.xlsx.

  3. -- Importing Data --

    This example will walk you through how to get data that already exists into SQLite.

    1. Download the main table for the Portal Project mammal database. It’s kind of large so it might take a few seconds. The full database is available on GitHub and updated monthly. We’ll use a simplified version in class.
    2. Create a new database by clicking on New Database in the Database drop down menu. Type portal_mammals for the name, click OK and choose where to save the file.
    3. Click on the Import icon.
    4. Click on Select File and navigate to where you saved the data file and select it.
    5. Select CSV. You can also import from SQL databases or modify the Fields separated or enclosed by. Make sure to select First row contains column names.
    6. Click OK when it asks if you want to modify the data.
    7. Name the table that you are importing into surveys.
    8. Identify the type for each field, using the Data Type drop-down menus. If it is not obvious if the data type is an INTEGER or VARCHAR for each variable, check the metadata. Important: if you specify the wrong data type it can cause some data to not be imported and/or prevent you from doing some kinds of data manipulations.
    9. Select recordID as the Primary Key and click OK.
    10. Click OK when asked if you are sure you want to import the data.
    11. Now import the plots, and species tables.
    [click here for output] [click here for output] [click here for output]