In-class Activities - Week 14

Automated Data Extraction

Efficiency Calculator, via xkcd

Part 1: Scraping Data from Tables on Websites

Import Wikipedia tables into Google Sheets.
a. Open Wikipedia page of List of Countries by Population (UN)
b. Open a Google Sheet, click on cell A1, and then enter this in the ‘Function Bar’:

=ImportHtml("https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)", "table", 1)

c. Explanation: the URL is the page for the entry, and ,1 at the end indicates what table on that page you want to import. Be sure to include the quotation marks around the “URL” and “table”.
Scrape tables from websites into R with rvest (Live Coding).

a. Scraping a table from a generic webpage: link to code used in class.

b. A tutorial on Importing a Wikipedia table to R can be found here.

Part 2: Downloading Books & Text Mining (Live Coding)

Download from Project Gutenberg, Text Extraction, Sentiment Analysis, and Visualizations: link to code used in class

Part 3: Extract data from published figures

Web Plot Digitizer to extract data from several figures that differ in quality and content. Download the following three figures and we will extract the data from them using WPD:
- Bar Graph from Jiang et al. (2021)
- Simple Scatter Plot from Pereira (2018)
- Complex scatter plot from Pereira (2018)

Part 4: Extract data from pdfs/images with OCR

For submission to Canvas: SmallPDF, OCR2Edit, and pdf2go are three widely used sites that allow you to use Optical Character Recognition (OCR) to extrat text and numbers from .pdf or image files (i.e., .tiff, .jpg). Use these websites to:

Extract text from these pdfs and images
Import the data from these images and pdf’s into an excel sheet
Compare the results: are some ties better than others at certain kinds of images? How much does image quality matter?
Upload the OCR scans and your (brief) conclusions to Canvas.
Bonus (if time permits)
- Try using google’s OCR to import into google Docs (tutorial). Note also Google’s AI OCR.
- Try with your iPhone or Android phone.

OPTIONAL: Learning R Markdown

What is R Markdown? Short version: it’s a way to convert your R Markdown file (with text and code) into any one of several output formats including: HTML, PDF, MS Word, slides for presentations (Beamer, HTML5 slides), books, dashboards, scientific articles, and websites. You can find the complete overview here. A gallery of the different things you can do with R Markdown can be found here.

How does it work?

“When you knit the document, R Markdown sends the .Rmd file to knitr, http://yihui.name/knitr/, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and its output. The markdown file generated by knitr is then processed by pandoc, http://pandoc.org/, which is responsible for creating the finished file. The advantage of this two step workflow is that you can create a very wide range of output formats”"

1. Tutorials & Resources

HW’s tutorial is here
a PC-specific tutorial
a MacOS specific tutorial
This one is really comprehensive, probably the best of the group.
Actually, I like this one even better
R Markdown Cheat Sheet: Download here
LaTeX Cheat Sheet: Download here from the “Contributed” tab

2. Install the following packages:

bookdown
knitr (an overview is here)

3. If you want to render to pdf, you will also need to install some additional tools:

Pandoc. After installation, you should add the path of the Pandoc executable to the system PATH.
A TeX distribution. For PC You can use TeX Live or MiKTeX; for MacTeX base on your platform.
Just in case, here is TexShop
Note: TeX Live is designed to be cross-platform (running on Unix, MacOS, and Windows), MacTeX includes Mac-specific utilities and front-ends (such as TeXShop and BibDesk).
Suggestions for the YAML (including latex packages and commands) to start customizing your pdf: link to sample YAML
If you are writing a scientific paper, the rticles package provides a suite of custom R Markdown templates for different journals. It is really convenient and easy to use.
Even more useful is the papaja package. This provides a template for preparing journal articles in the APA style, and is very customizable (it’s the default I use for most of my papers).
You can easily insert citations in your R Markdown document using the citr package. You can insert citations directly with the RStudio add-in and link to Zotero libraries.

Last updated on Apr 12, 2024