In-class Activities - Week 6

QA/QC 2: Using OpenRefine to clean data

How many ways can you spell the word...
How many ways can you spell the word…
.

OpenRefine is a powerful, free, and open source tool that is used to work with and clean messy data. We will be working through some of OpenRefine’s basic features, after which you will trying them onm your own on a new data set.

Note: This is the url for accessing OpenRefine if a new tab/window doesn’t open: http://127.0.0.1:3333/

1. Intro to OR

2. Working with OR

  1. Download the data file for the tutorial to your computer click this link.

3. Filtering and Sorting

Break

5. Examining Numbers

6. Using Scripts, Exporting, and Saving

7. Wrap-up, Questions

Assignment

Now it’s your turn. Download this csv file and use OpenRefine to clean it up. After you create a Project, edit the data as follows:

  1. Correct and standardize the names of the countries in which the rodents were captured.

  2. The column scientificName contains two pieces of information (the Genus and species of each animal). Split this into two columns, rename them as genus and species, and then correct and standardize the data in each column as needed. NB: You may run into an obstacle when you try to rename the columns. How can you get around it?

  3. Save the clean data as a CSV file on your desktop.

  4. Extract and save your steps (i.e., ‘operation history’ as JSON. Save this as a text file.

  5. Bonus Brainteaser: Many of the cells in the column for the Latin bonomial are blank. How might you go about filling them in based on the column with the abbreviation?

  6. Submission: Submit your clean .csv and the JSON text file as week6_hw on Canvas.

Grading Rubric:

Data corrected and JSON file can be used on another data set: 50
Most data correction properly programmed; some require instructor follow-up: 40
Many of the corrections missing, JSON file unable to process new data : 30
Instructor follow-up required to implement most changes: 20

Sources for this lesson

OpenRefine Home

Tutorials

GREL Cheatsheets

R Tools

  • The rrefine package allows you to do some OpenRefine tasks from within R, such as import, export, apply data cleaning operations, or delete a project in OpenRefine directly from R. In other words, it’s for repeating operations in R after you’ve worked with OpenRefine.
Previous
Next