In-class Activities - Week 14

Automated Data Extraction

Efficiency Calculator, via xkcd

Part 1: Scraping Data from Tables on Websites

  1. Import Wikipedia tables into Google Sheets.
    a. Open Wikipedia page of List of Countries by Population (UN)
    b. Open a Google Sheet, click on cell A1, and then enter this in the ‘Function Bar’:

    =ImportHtml("https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)", "table", 1)

    c. Explanation: the URL is the page for the entry, and ,1 at the end indicates what table on that page you want to import. Be sure to include the quotation marks around the “URL” and “table”.

  2. Scrape tables from websites into R with rvest (Live Coding).

    a. Scraping a table from a generic webpage: link to code used in class.

    b. A tutorial on Importing a Wikipedia table to R can be found here.

Part 2: Downloading Books & Text Mining (Live Coding)

  1. Download from Project Gutenberg, Text Extraction, Sentiment Analysis, and Visualizations: link to code used in class

Part 3: Extract data from published figures

  1. Web Plot Digitizer to extract data from several figures that differ in quality and content. Download the following three figures and we will extract the data from them using WPD:

Part 4: Extract data from pdfs/images with OCR

For submission to Canvas: SmallPDF, OCR2Edit, and pdf2go are three widely used sites that allow you to use Optical Character Recognition (OCR) to extrat text and numbers from .pdf or image files (i.e., .tiff, .jpg). Use these websites to:

  1. Extract text from these pdfs and images

  2. Import the data from these images and pdf’s into an excel sheet

  3. Compare the results: are some ties better than others at certain kinds of images? How much does image quality matter?

  4. Upload the OCR scans and your (brief) conclusions to Canvas.

  5. Bonus (if time permits)

OPTIONAL: Learning R Markdown

What is R Markdown? Short version: it’s a way to convert your R Markdown file (with text and code) into any one of several output formats including: HTML, PDF, MS Word, slides for presentations (Beamer, HTML5 slides), books, dashboards, scientific articles, and websites. You can find the complete overview here. A gallery of the different things you can do with R Markdown can be found here.

How does it work?

“When you knit the document, R Markdown sends the .Rmd file to knitr, http://yihui.name/knitr/, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and its output. The markdown file generated by knitr is then processed by pandoc, http://pandoc.org/, which is responsible for creating the finished file. The advantage of this two step workflow is that you can create a very wide range of output formats”"

1. Tutorials & Resources

2. Install the following packages:

3. If you want to render to pdf, you will also need to install some additional tools:

  • Pandoc. After installation, you should add the path of the Pandoc executable to the system PATH.
  • A TeX distribution. For PC You can use TeX Live or MiKTeX; for MacTeX base on your platform.
  • Just in case, here is TexShop
  • Note: TeX Live is designed to be cross-platform (running on Unix, MacOS, and Windows), MacTeX includes Mac-specific utilities and front-ends (such as TeXShop and BibDesk).
  • Suggestions for the YAML (including latex packages and commands) to start customizing your pdf: link to sample YAML
  • If you are writing a scientific paper, the rticles package provides a suite of custom R Markdown templates for different journals. It is really convenient and easy to use.
  • Even more useful is the papaja package. This provides a template for preparing journal articles in the APA style, and is very customizable (it’s the default I use for most of my papers).
  • You can easily insert citations in your R Markdown document using the citr package. You can insert citations directly with the RStudio add-in and link to Zotero libraries.
Previous
Next