A project from scratch

Last updated on 20/Apr/2022 26 min read 0 Commentsdata science

When you first learn R, you rarely start by acquiring Best Practices: tips to keep your code organized, reproducible and at least not hopelessly messy. You wouldn’t, because you want to get results… and because Best Practices are solutions for problems that you haven’t encountered yet. It is hard enough to learn how to filter a dataframe in R when you have never encountered a dataframe that matters to you, let alone wanted to filter it. Learning the most appropriate folder structure to store the dataframe and your analysis does not really stand out as an interesting goal until you have so many dataframes you care about, so many questions to answer with code, that it overflows all logic and you can’t just keep track of it in a post-it any more. But by then you are not taking courses on R any more, and unless you happen to run into a colleague and the topic comes up of how much time you spent this morning trying to find the function that created the best plot you ever made, or unless you follow the right people on Twitter and they mention cool stuff out of the blue… how will you learn those things? Even if you find out about amazing solutions to these problems, their implementation is not obvious enough. You have a cup of tea with a lump of sugar at the bottom, you find a spoon, but it doesn’t immediately occur to you that the spoon is the answer to your problem, because the documentation of the spoon incorrectly assumes you already know what stirring is. Or you find a video of somebody stirring their tea with the spoon, but you can’t figure out how they got the spoon inside the cup in the first place. What is worse: those who know how to use spoons have forgotten what it means to be unaware of them.

I would like to address those feelings with this post. I recently found out about two R-project tools that blew my mind, and I want to start implementing them in my workflow. I also thought it could be practical to show how you can start up a project like this from scratch, maybe learning some Best Practices along the way. You don’t need to know all the tips at the beginning: you learn them best by messing up, and as long as things work, it’s fine. If I write this properly, after this post you will be able to create an R project and GitHub repository using the packages here, targets and renv, write a brief R Markdown report with some analysis of a (clean) dataset, and leave with the confidence that:

you know what your project contains;
the report matches your data and analysis;
anyone with access to your project can reproduce it;
you can rerun everything you need with one line of code.

The output of this workflow can be found in its GitHub repository, but the idea is to give you the instructions to start it from scratch. It is a looong post with a lot of detail, so sit back and relax, take your time and enjoy the ride. Let’s go!

Initialize a project

Some time ago I learned about RStudio projects. I don’t remember when, and I don’t remember the foggy, chaotic times before. Now and then I imagine what my life would be like if I worked on R without projects and chills go up my spine.

RStudio projects support a project-oriented workflow, where you have one folder for each different project you work on and the paths called within the project are always relative to the root of the project. Suppose, for example, that your project folder looks like this:

|-- data/
    |-- dataset.csv
|-- scripts/
    |-- code.R
|-- report.Rmd
|-- report.html

If you run code from scripts/code.R, you might want to find your dataset as ../data/dataset.csv, whereas from report.Rmd you would need to read data/dataset.csv. But if you call scripts/code.R from report.Rmd, the path within scripts/code.R has to become data/dataset.csv! And if one day you decide to move scripts/code.R to a different folder, you will need to update the path to the dataset, even if the dataset itself didn’t move! You might be tempted to use absolute paths, like C:/Users/username/projects/thisparticularproject/data/dataset.csv, but then when you change computers it doesn’t work any more. A reproducible project should be able to change computers, at least.

A cartoon showing two paths side-by-side. On the left is a scary spooky forest, with spiderwebs and gnarled trees, with file paths written on the branches like “~/mmm/nope.csv” and “setwd(“/haha/good/luck/”), with a scared looking cute fuzzy monster running out of it. On the right is a bright, colorful path with flowers, rainbow and sunshine, with signs saying “here!” and “it’s all right here!” A monster facing away from us in a backpack and walking stick is looking toward the right path. Stylized text reads “here: find your path.” — Artwork by @allison_horst

With the here package, the here::here() function returns the absolute path of the project (in this case C:/Users/username/projects/thisparticularproject); if you move the project somewhere else, to another directory or computer, it will return whatever the absolute path turns out to be. Then, here::here('data') will return the path to the data folder wherever you call it from within the project. Therefore, either from code.R, report.Rmd or wherever else in the project, here::here('data', 'dataset.csv') will always point to data/dataset.csv. Neat, right?

Organizing your work in projects also means that different studies, analyses, explorations are kept separate, and can even be run simultaneously without intervening with each other. For example, I can write this blog post on my blogdown project and create and work on a new, different project at the same time, each with their own RStudio window, their own log and data on memory, etc. While writing this I can go to the top right corner of my RStudio window, click on the triangle next to the name of my current project, select “New Project”, choose to create a normal project¹ and set up the main points: the name of the folder and its location (Figure 1). For about an year now I started making my projects as git repositories as well and storing them like all other git repositories, in a folder called repos which has a subfolder per username and then their repositories inside. For example, repos/montesmariana has all my git repositories, be they R Projects or not (e.g. python code); repos/another-user/repo-I-cloned stores some git repository by another-user that I cloned for their code or data, to collaborate with them, etc.² The project creation wizard has the options to set up the project immediately as a git repository and to use renv; both can be applied post-hoc³, but let’s go ahead and select them. I’m showing how to do this from scratch, after all.

Three screenshots of the steps to create an RStudio project, slightly overlapping. The first shows a dropdown menu with the option to create an RStudio project. The second shows a window with options of the kinds of project to initialize, and the basic project indicated with a green rectangle. The third one shows a dialog window that lets you name your project and select an initialization configuration.

Figure 1: Screenshots of the first steps creating an R Studio project.

I’ve already talked about using git elsewhere, but what about renv? I actually learned about it recently and have not used it yet, but it appears to solve an issue I have had to work around before: package versioning for my projects. R 4.1 came out while I was writing my dissertation, and I didn’t dare install it until it was finished, lest it make anything crash in my fragile bookdown project. If I had known what renv was, I could have kept my project isolated while using R 4.1 and the updated packages in other projects. In other contexts, renv will help me share projects with the precise information of the R and R package versions used to run everything and avoid conflicts that could arise between different versions of a function. To learn more about renv you may go to the documentation.

We have now initialized a project. The file structure can be found in the Files panel - it only has configuration files that characterize an R project (.Rprofile, .Rproj.user/ and projectname.Rproj) with renv (renv/ and renv.lock) and a git repository (.git/ and .gitignore). Note that .gitignore files tell git what not to keep track of; an R Project with git will already write a list of file extensions that should be ignored, but renv also creates a renv/.gitignore with a list of directories within renv that git should ignore.

The dataset

Since this project is about data analysis, we first need to acquire some data. I wanted to give a linguistics example, ideally not in English (there is enough of that already), so I chose the doenLaten dataset of Natalia Levshina’s Rling package, companion of her wonderful How to do linguistics with R (Levshina 2015). To install the package you can follow the instructions on the README file, and then the dataset is available with data():

install.packages('path/to/file/Rling_1.0.tar.gz', # use your own path
                 type = 'source', repos = NULL)
data(doenLaten, package = 'Rling')

Since we are still learning, we’ll open an empty R script, write the code in it and run it. Later, the install.packages() call should be removed or at least kept separate from analysis scripts. As shown below, head(doenLaten) shows us the first rows of the dataframe.

head(doenLaten)

##     Aux Country  Causation EPTrans EPTrans1
## 1 laten      NL   Inducive    Intr     Intr
## 2 laten      NL   Physical    Intr     Intr
## 3 laten      NL   Inducive      Tr       Tr
## 4  doen      BE  Affective    Intr     Intr
## 5 laten      NL   Inducive      Tr       Tr
## 6 laten      NL Volitional    Intr     Intr

Levshina introduces this dataset in chapter 12 of her book, to illustrate logistic regression; you can look at the documentation with help(doenLaten). Each of the 455 rows of this dataframe represents an observation of the Dutch Causative Construction (in English it would be “I made you do this”) with 5 variables:

Aux, specifying whether the constructions uses the verb doen or laten where English would use make;
Country, with level NL for data from the Netherlands and BE for Belgian data;
Causation, with four different kinds of causation depending on whether the Causee and/or the Causer are mental entities or not (I won’t get into that);
EPTrans and EPTrans1, which specify whether the Effected Predicate (e.g. do in I made you do this) is transitive (i.e. it carries a direct object) or not.

The idea is to predict whether the observation contains doen on laten based on the other predictors; see Levshina (2015, 255–56) and her own PhD dissertation (Levshina 2011) for an explanation.

To get an idea of the data, we could run a few exploratory functions. For now, we can still type them in our temporary R script where we loaded the data: head(doenLaten), str(doenLaten), summary(doenLaten$Aux) to get an idea of the frequencies of the response variable. A fancy way is with skmir::skim() (but note that if you do this in your new project you will need to install skimr):

skimr::skim(doenLaten)

Table 1: Data summary
Name	doenLaten
Number of rows	455
Number of columns	5
_______________________
Column type frequency:
factor	5
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Aux	1	FALSE	2	lat: 277, doe: 178
Country	1	FALSE	2	NL: 233, BE: 222
Causation	1	FALSE	4	Ind: 185, Vol: 117, Aff: 90, Phy: 63
EPTrans	1	FALSE	2	Int: 281, Tr: 174
EPTrans1	1	FALSE	2	Int: 276, Tr: 179

In order to make this more comparable to a typical analysis, where you would have your own copy of the data, we’ll have to save the dataset as a file in our project folder. This invites us to start loading the tidyverse package (or at least readr) but… it does not exist! If you try library(tidyverse) like you have many times before in your computer you might be confused by its sudden absence. But remember that we set up renv in this project, which means that it only uses the project-specific library: you have to install all the packages that you want to use again (technically, make a local copy if you have them installed already). Our next steps are then the following:

Install tidyverse (which we’ll use later as well) with install.packages('tidyverse');
Save the dataset as a csv file with readr::write_csv(doenLaten, 'data.csv')

Our current temporary R script will look like this:

# Untitled.R

# install.packages('../../levshina/Rling_1.0.tar.gz', 
#                  type = 'source', repos = NULL)
data(doenLaten, package = 'Rling')
head(doenLaten)

install.packages('tidyverse')
readr::write_csv(doenLaten, 'data.csv')

After installing everything, we should restart R. To inspire you to even better practices, I will pretend I always save my temporary R scripts and save the file as temp.R (hopefully we don’t have to run it again).

The analysis

In Chapter 12 of How to do linguistics with R, Levshina uses this dataset to illustrate logistic regression. She shows how to use rms::lrm() and stats::glm() with a focus on the former because of its detailed output. Here I will only use glm() because, while it’s harder to obtain the $C$ value and similar statistics, I know how to get the data for a prettier table that we’ll report at the end of the post. I will also add some rough explanation of how to interpret the output, strongly informed by Levshina (2015), but I won’t delve into the linguistic consequences —this is just so that if you, reader, are not familiar with logistic regression, your head doesn’t blow up in so many pieces.

First we will create a new R script to explore the output of the analysis. Writing code in an R script rather than on the console is a good practice, because you WILL want to run things again at some point, and it’s easier to find that code in a file than in the console. (I often start testing ideas in the console and it’s the equivalent of whisking egg whites with a fork because the whisk is in a different room: wrongly placed laziness.)

# analysis.R
library(tidyverse)

doenLaten <- read_csv('data.csv', show_col_types = F)

If we use doenLaten from the Rling package directly, it is a dataframe where each column is a factor, sometimes with a predefine reference level. However, if we use the dataset as we stored it, readr::read_csv() will automatically read the columns as character vectors. This brings us to another step that we will often have in data analysis between opening our file and actually modelling: cleaning the data. In this case, we will turn all character columns into factors and manually change the reference values of the response and the Country predictor to match the original format of Levshina’s dataset:

doenLaten <- doenLaten %>% 
  mutate(across(where(is.character), factor), # change all character vectors into factors
         Aux = fct_relevel(Aux, 'laten'), # set up 'laten' as the first level of Aux
         Country = fct_relevel(Country, 'NL') # set up 'NL' as the first level of Country
         )

For the modelling we’ll use stats::glm(), which takes a formula Response ~ Predictor1 + Predictor2... and a dataset. Saving the output in a variable allows us to explore it in different ways without having to rerun the modelling every time.

m <- glm(Aux ~ Causation + EPTrans + Country, data = doenLaten, family = binomial)
summary(m)

## 
## Call:
## glm(formula = Aux ~ Causation + EPTrans + Country, family = binomial, 
##     data = doenLaten)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4837  -0.5344  -0.3428   0.3838   2.5340  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           1.8631     0.3771   4.941 7.79e-07 ***
## CausationInducive    -3.3725     0.3741  -9.015  < 2e-16 ***
## CausationPhysical     0.4661     0.6275   0.743 0.457575    
## CausationVolitional  -3.7373     0.4278  -8.735  < 2e-16 ***
## EPTransTr            -1.2952     0.3394  -3.816 0.000136 ***
## CountryBE             0.7085     0.2841   2.494 0.012633 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 609.05  on 454  degrees of freedom
## Residual deviance: 337.70  on 449  degrees of freedom
## AIC: 349.7
## 
## Number of Fisher Scoring iterations: 5

The algorithm compares the second level of the response variable (in this case doen) to the first, aka reference level (here laten); the values under the Estimate column correspond to the log odds of finding doen rather than laten. In particular, the value of the Estimate for the Intercept corresponds to the log odds of finding doen rather than laten when all predictors are at their reference level (the first level of the factor), that is, when Causation is affective, the effective predicate is intransitive and the language variety is Netherlandic Dutch. Positive log odds indicate that doen is more likely, whereas negative log odds indicate that laten is more likely.

doenLaten %>% 
  filter(Causation == 'Affective', EPTrans == 'Intr', Country == 'NL') %>% 
  count(Aux)

##     Aux  n
## 1 laten  3
## 2  doen 20

exp(m$coefficients[['(Intercept)']])

## [1] 6.443584

When it comes to the predictors, the Estimate refers to the log odds ratio of doen for a given value of the variable (e.g. Tr for EPTrans) compared to the reference value (Intr for EPTrans). Negative values indicate that doen is less likely than the reference response value laten; the simple odds ratio of doen against laten when the Effected Predicate is Transitive, all other variables being controlled for, is then 0.274.

Other important information is, of course, the p-value, indicating whether the Estimate is significant (i.e. that it wouldn’t be due to chance), and measures of precision, explained variation, etc. For glm() output, we can compute the $C$ value with Hmisc::somers2(), which takes the values predicted by the model and the actual output as 1s and 0s and returns the $C$ value along with other metrics.

Before running this, however, we have to go back to temp.R and run install.packages('Hmisc'). And even before that, we might want to use renv to take a snapshot of the package versions that we are using right now, since we know they work. That way, if we install Hmisc or other packages and something breaks, we can go back to this point with renv::restore(). So:

In temp.R, run renv::snapshot(), accept to save the data.
In temp.R, run install.packages('Hmisc').
Add library(Hmisc) to the top of analysis.R.
Get your $C$ !

# analysis.R
somers2(fitted(m), as.numeric(doenLaten$Aux)-1)[['C']]

## [1] 0.8936539

For a proper logistic regression analysis I strongly recommend reading Levshina’s chapter and other appropriate sources; for the purposes of this excessively detailed post, I will stop here. Right now, our analysis.R file looks like this:

# analysis.R
library(tidyverse)
library(Hmisc)

doenLaten <- read_csv('data.csv', show_col_types = F)
doenLaten <- doenLaten %>% 
  mutate(across(where(is.character), factor),
         Aux = fct_relevel(Aux, 'laten'),
         Country = fct_relevel(Country, 'NL'))

m <- glm(Aux ~ Causation + EPTrans + Country, data = doenLaten,
         family = binomial)
summary(m)
somers2(fitted(m), as.numeric(doenLaten$Aux)-1)[['C']]

Basic report

A script like this is… meh. You got your exploration, you started, more or less. But you wouldn’t send this as a report. How will you share this? How will this become part of your reporting? (Also: will your future self, a month from now, quickly understand what is going on?)

We will therefore create a brief and simple R Markdown report. Before, however, we will package everything we did in simple functions that can be easily called; I wrote elsewhere about my approach to functions, but the dedicated chapter in the targets manual is also interesting in relation to the logic of targets, which we’ll look into shortly.

Everything into functions

Each step of your analysis.R file will be surrounded by a function call, focusing on having a specific input and returning a specific output without side effects (such as storing a file). I must admit that I am normally way messier than this, but targets has inspired me:

# analysis.R
library(tidyverse)
library(Hmisc)

data_file <- 'data.csv'
read_data <- function(data_file) {
  read_csv(data_file, show_col_types = F)
}
refactor_data <- function(raw_data) {
  raw_data %>% mutate(across(where(is.character), factor),
         Aux = fct_relevel(Aux, 'laten'),
         Country = fct_relevel(Country, 'NL'))
}
model <- function(mydata) {
  glm(Aux ~ Causation + EPTrans + Country, data = mydata,
         family = binomial)
}

get_c <- function(model, mydata) {
  somers2(fitted(model), as.numeric(mydata$Aux)-1)[['C']]  
}

Now if you run analysis.R it’s not going to return anything, do anything that counts, but just load the libraries and declare the functions that you need. But you could still run everything with the following code:

data <- read_data(data_file) %>% refactor_data()
m <- model(data)
summary(m)
get_c(m, data)

When we write our R Markdown, we can now call the functions where we want the output.

Reporting regression results

We’ll start creating an R Markdown document by selecting the appropriate option in the top left corner and setting up the basic data. In the meantime you will also have to accept installing rmarkdown, of course. I titled my file “Report” and set it to update the date every time I knit it (Figure 2).

Two screenshots of the steps to create an R Markdown file, slightly overlapping. The first shows a dropdown menu with the option to create the file. The second shows a dialog window with that lets you name your file and select an initialization configuration.

Figure 2: Screenshots of the first steps creating an R Markdown file.

I also removed all the content of the resulting file except for the metadata and the first chunk, resulting in the following content:

	---
	title: "Report"
	author: "Mariana Montes"
	date: "`r Sys.Date()`"
	output: html_document
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE)
	```

view raw basic-rmd.Rmd hosted with ❤ by GitHub

I saved the file as report.Rmd. In it I wrote a very brief report loading the data and show the output of the regression model and the C value. The R Markdown raw file is shown below; the HTML output is shown in Figure 3.

	---
	title: "Report"
	author: "Mariana Montes"
	date: "`r Sys.Date()`"
	output: html_document
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = FALSE) # do not show code
	source('analysis.R') # obtain code from analysis.R
	raw_data <- read_data(data_file) %>% refactor_data()
	```

	In this report we will analyse Levshina's doen-laten dataset,
	which as `r nrow(raw_data)` rows and `r length(raw_data)` columns.
	We've run logistic regression to predict the odds of doen against laten based on the different predictors.

	```{r, echo=FALSE}
	m <- model(raw_data)
	summary(m)
	```

	The $C$ value of the logistic regression model is `r sprintf("%.3f", get_c(m, raw_data))`.

view raw minimal-report.Rmd hosted with ❤ by GitHub

HTML rendering of the first version of the R Markdown report.

Figure 3: Screenshot of the first HTML report.

It’s a very ugly report, but it is a report and it shows what we have done. Before we move on to prettifying it or adding complexity, let’s take a look at targets!

Oh, wait, before that, we should take advantage of our version control⁴: we are using git! We should save what we have done and commit it! In the Terminal (next to the console) just type git commit -am "first report"!

Creating a workflow with `targets`

The targets package allows you to keep track of your workflow and of whether your output is up to date with each step. Crucially, it caches your results so that, if you modify part of your workflow, you only need to rerun that part and whatever depends on it. If you change the predictors in your model, you don’t need to change the variables from character vectors to factors again!

To set up targets, we need to install the targets package (as before: do it in temp.R, and if you want, run renv::snapshot() again), which will create a file called _targets.R:

	library(targets)
	# This is an example _targets.R file. Every
	# {targets} pipeline needs one.
	# Use tar_script() to create _targets.R and tar_edit()
	# to open it again for editing.
	# Then, run tar_make() to run the pipeline
	# and tar_read(summary) to view the results.

	# Define custom functions and other global objects.
	# This is where you write source(\"R/functions.R\")
	# if you keep your functions in external scripts.
	summ <- function(dataset) {
	summarize(dataset, mean_x = mean(x))
	}

	# Set target-specific options such as packages.
	tar_option_set(packages = "dplyr")

	# End this file with a list of target objects.
	list(
	tar_target(data, data.frame(x = sample.int(100), y = sample.int(100))),
	tar_target(summary, summ(data)) # Call your custom functions as needed.
	)

view raw _targets.R hosted with ❤ by GitHub

The template has basic instructions and an example to get you going. You can read more in the manual. How do we adapt it? Our personal _targets.R needs to have the following elements:

A call to library(targets), as in the template.
A source() call to our file with code: source('analysis.R') in this case, but we might want to create an R folder to add such a script. You may not see the need now, but it will be useful for growing projects with more code.
A call to tar_option_set() indicating the packages we need to load. This replaces the calls to library() at the beginning of analysis.R, so we’ll need to remove them from there.
A list of target objects, matching files and functions to variables and setting up a pipeline. In our case:
1. We can match the file path to data_file. This means that targets will watch that file and rerun everything depending on it if it changes.
2. We can match the function that reads the file to raw_data, and then the function that changes the column types to data.
3. We can match m to the function that runs the model and C to the function that obtains the c value.

Notice that I don’t use the same names, and that I try to use verbs for the functions (defined in analysis.R) and nouns or letters for my objects. These changes result in the following code for _targets.R:

	library(targets)

	source('analysis.R')
	tar_option_set(packages = c('tidyverse','Hmisc'))

	list(
	tar_target(
	data_file,
	'data.csv',
	format = 'file'
	),
	tar_target(
	raw_data,
	read_data(data_file)
	),
	tar_target(
	data,
	refactor_data(raw_data)
	),
	tar_target(
	m,
	model(data)
	),
	tar_target(
	C,
	get_c(m, data)
	)
	)

view raw _targets.R hosted with ❤ by GitHub

And we have changed analysis.R like so:

	read_data <- function(data_file) {
	read_csv(data_file, show_col_types = F)
	}
	refactor_data <- function(raw_data) {
	raw_data %>% mutate(across(where(is.character), factor),
	Aux = fct_relevel(Aux, 'laten'),
	Country = fct_relevel(Country, 'NL'))
	}
	model <- function(mydata) {
	glm(Aux ~ Causation + EPTrans + Country, data = mydata,
	family = binomial)
	}

	get_c <- function(model, mydata) {
	somers2(fitted(model), as.numeric(mydata$Aux)-1)[['C']]
	}

view raw analysis.R hosted with ❤ by GitHub

We can now visualize the pipeline with targets::tar_visnetwork() (which, btw, requires us to install the visNetwork package, but it’s worth it!). The different objects are shown in Figure 4 as circles, functions are shown as triangles, the edges of the network show the dependency and the colors indicate whether the objects and the output are up to date or not. Right now, since we haven’t run anything from the workflow, everything is outdated.

Linear network of blue circles and triangles representing the current pipeline.

Figure 4: Network representation of the initial pipeline.

If we run targets::tar_make(), each step of the pipeline in the _targets.R file is run and the output is stored in the _targets directory. The output shows that everything was run and how long it took.

targets::tar_make()
#> * start target data_file
#> * built target data_file
#> * start target raw_data
#> * built target raw_data
#> * start target data
#> * built target data
#> * start target m
#> * built target m
#> * start target C
#> * built target C
#> * end pipeline: 2.55 seconds

We can then access the different objects with targets::tar_read() or load them into the current environment with targets::tar_load(). Running targets::tar_visnetwork() again shows that everything is now up to date.

Figure 5: Network representation of the initial pipeline after running targets::tar_make().

Reports with `targets`

While we have a targets pipeline set up, our report is still isolated from it. We can connect our R markdown report with the pipeline by adding it to the list in _targets.R with tarchetypes::tar_render(), and then calling the different objects and functions with targets::tar_load() and targets::tar_read():

In temp.R, run install.packages('tarchetypes').
In _targets.R, add to the top library(tarchetypes).
In _targets.R, add to the bottom of the list of target objects tar_render(report, 'report.Rmd') (as long as your report file is also called ‘report.Rmd’).
In report.Rmd, remove source('analysis.R'), add library(targets) instead and replace the calls to objects and functions with the appropriate targets functions:

	---
	title: "Report"
	author: "Mariana Montes"
	date: "`r Sys.Date()`"
	output: html_document
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = FALSE) # do not show code
	library(targets)
	tar_load(data)
	```

	In this report we will analyse Levshina's doen-laten dataset,
	which as `r nrow(data)` rows and `r length(data)` columns.
	We've run logistic regression to predict the odds of doen against laten based on the different predictors.
	We are now running everything with `targets`!.

	```{r, echo=FALSE}
	summary(tar_read(m))
	```

	The $C$ value of the logistic regression model is `r sprintf("%.3f", tar_read(C))`.

view raw report.R hosted with ❤ by GitHub

HTML rendering of the first version of the R Markdown report, with a new line indicating it's linked to the targets package.

Figure 6: Screenshot of the HTML report after linking it to targets.

The network visualization in Figure 7 now includes the R Markdown file.

Linear network of blue and black circles and triangles representing the current pipeline.

Figure 7: Network representation of the initial pipeline after linking the R Markdown report.

Note that you can knit it independently (and, as shown in Figure 6, it looks just like before!) but it won’t show as up to date unless you run targets::tar_make(). At this point, we should do it, take a renv::snapshot() and, in the Terminal, git commit -am "add targets". When you run targets::tar_make(), note that it will skip all the steps that are up to date, and only run the report! This is particularly useful if what you ran before requires some intensive computation.

What if you change something?

About a month ago I saw a presentation at the German Cognitive Linguistics Conference in which a researcher⁵ showed the output from a regression model in a very beautiful, intuitive table. As someone who does not normally run regressions, I found it incredibly helpful to figure out how to interpret the data. Below you can see a function, prettify_model(), that I wrote to turn the output of glm() into a table strongly inspired by that presentation. I won’t explain the code here, but you can contact me if you are interested — the gist of it is that it takes the output of glm() and your dataframe as input, and returns an Rmarkdown-ready table.

	prettify_model <- function(model, mydata) {
	variables <- map_df(colnames(mydata), function(v) {
	if (class(mydata[[v]]) == 'factor') {
	values <- names(table(mydata[[v]]))
	reference <- levels(mydata[[v]])[[1]]
	} else {
	values <- ''
	reference <- ''
	}
	tibble(varname = v, varvalues = values, reference = reference, class = class(mydata[[v]]))
	}) %>%
	filter(class == 'numeric' \| varvalues != reference) %>%
	mutate(varcode = paste0(varname, varvalues)) %>%
	filter(varcode %in% colnames(model.matrix(model))) %>%
	mutate(varparse = if_else(class == 'factor', sprintf("<strong>%s</strong> vs. %s", varvalues, reference), ''))

	response <- all.vars(formula(model))[[1]]
	predicted_response <- levels(mydata[[response]])[[2]]

	as_tibble(data.frame(summary(model)$coefficients), rownames = "Variables") %>%
	left_join(variables, by = c('Variables' = 'varcode')) %>%
	rename(Significance = `Pr...z..`) %>%
	mutate(varparse = if_else(Variables == '(Intercept)', 'Intercept', varparse),
	varname = if_else(Variables == '(Intercept)', '', varname),
	isSignificant = Significance < 0.05,
	Odds = exp(Estimate),
	across(c('Significance', 'Estimate', 'Odds'), ~cell_spec(sprintf("%.3f", .x), bold = isSignificant)),
	varname = text_spec(varname, extra_css = "font-variant:small-caps;")
	) %>%
	select(
	` ` = varname,
	`Independent variables` = varparse,
	Estimate,
	Significance,
	`Odds Ratio` = Odds,
	) %>%
	kable(escape = F, booktabs = T, caption = sprintf("Logistic regression model predicting %s.", predicted_response)) %>%
	kable_paper() %>%
	collapse_rows(columns = 1, valign = 'top')

	}

view raw prettify_model.R hosted with ❤ by GitHub

This is an excellent opportunity to learn how to integrate any new addition — modifying existing code, adding functions or data… — to your existing targets pipeline:

In temp.R, install missing packages if needed. In our case, we need kableExtra. Because we actually need the development version of kableExtra⁶, we will install it with devtools::install_github(), which means we need to install devtools first. So:
1. Run install.packages('devtools')
2. Run devtools::install_github('haozhu233/kableExtra')
3. If you’re not interested in devtools any more, you can finish off with renv::remove('devtools') to remove it from your project-specific library.
In analysis.R, add the code for prettify_model(). Of course, I would suggest testing it, running the function definition and typing in the console:

library(tidyverse)
library(kableExtra)
library(targets)
tar_load(m)
tar_load(data)
prettify_model(m, data)

Instead of adding library(kableExtra) to the top of analysis.R, we add ‘kableExtra’ to the vector of packages in the tar_option_set() call in _targets.R. That line will then become tar_option_set(packages = c('tidyverse', 'Hmisc', 'kableExtra')).
In _targets.R, we add the following targets object before the tar_render() call:

# tar_target(C, get_c(m, data)),
tar_target(
    pretty_table,
    prettify_model(m, data)
  )
#, tar_render(report, 'report.Rmd')

In report.Rmd, we replace summary(tar_read(m)) with tar_read(pretty_table).

In essence, we have added the output of prettify_model(m, data) to the pipeline, naming it pretty_table, and then called it from the report. After saving all the files, targets::tar_visnetwork() shows us a nice increase in complexity (see Figure 8). We have a new triangle representing prettify_model(), which is only linked to the new circle pretty_table; this circle depends on m and data while report depends on it. Only these new elements and report are now outdated, so a call to targets::tar_make() will skip refining the data and modelling and only focus on the table and report.

Figure 8: Network representation of the initial pipeline after adding prettify_table().

targets::tar_make()
#> v skip target data_file
#> v skip target raw_data
#> v skip target data
#> v skip target m
#> v skip target C
#> * start target pretty_table
#> * built target pretty_table
#> * start target report
#> * built target report
#> * end pipeline: 4.48 seconds

Our report is now a bit prettier than it used to be (see Figure 9). The output of the regression model is now rendered as an elegant table highlighting the Estimate (log odds ratio), the Significance (p-value) and the Odds Ratio, which was computed within prettify_table(). The rows with a p-value lower than 0.05 are in bold. The caption specifies which value is being predicted and the predictors also state clearly what the reference value is in each case, collapsing the rows that correspond to the same predictor.

HTML rendering of the first version of the R Markdown report, with a pretty rendering of the regreesion results.

Figure 9: Screenshot of the HTML report with a prettier table for the regression results.

Conclusion

Now we can take another renv::snapshot() and, in the Terminal, git commit -am "prettify table". We could even connect the repository to GitHub and push it online. We already have a basic workflow to keep adding functions and objects to. We could add more analyses and reporting, plots. We could do whatever we want!

This post does not cover everything that would constitute a “good practice”. It does not tell you how to name files, to have an R folder for your code, a data folder for your data, a reports folder for your reports… I do think these are useful tips, but it’s hard to integrate them into your workflow until your projects are not complex enough to require them. And while they are good habits to incorporate, I felt it would be interesting to show how to start a project, rather messily, as if you didn’t really know what you are going to do (even which packages you will need), because that is often the case. Actual data science includes a LOT of data cleaning and exploration, testing out different ways of looking at the data, finding the best way to report it (see this post for example). I tried to strike a balance between honest messiness and useful neatness. Hopefully you can follow this and feel confident of your tools to set up a self-contained, reproducible data analysis project, a neat scaffolding for the messes to come.

References

Levshina, Natalia. 2011. “Doe Wat Je Niet Laten Kan: A Usage-Based Analysis of Dutch Causative Constructions.” PhD thesis, Leuven: KU Leuven. https://lirias.kuleuven.be/retrieve/172344.

———. 2015. How to Do Linguistics with R: Data Exploration and Statistical Analysis. Amsterdam; Philadelphia: John Benjamins Publishing Company.

It could also be a blogdown project, like this blog; a bookdown project, an R package…↩︎
I must give credit to Miguel Montes for the idea; I didn’t really see the point the first time he told me about it but I followed his advice and it has really paid off.↩︎
If you want to learn how to set up git, especially if you have never used git and don’t have a GitHub account (I won’t cover that here), I recommend checking out Jenny Bryan et al.’s Happy Git with R. renv can be initialized in an open project with renv::init().↩︎
This could have been done multiple times before: when first setting up the project, when saving the data, with the first analysis… But I honestly forgot about it, and maybe by admitting it you can feel good about forgetting as well, and take it easy in the long road of incorporating git to your workflow.↩︎
Alicja Piotrowska presenting Genitive variation in Danish and Swedish – the iconic motivation.↩︎
For some reason kableExtra::collapse_rows() is not working in the CRAN version at the moment.↩︎

R R Markdown

A project from scratch

Initialize a project

The dataset

The analysis

Basic report

Everything into functions

Reporting regression results

Creating a workflow with `targets`

Reports with `targets`

What if you change something?

Conclusion

References

Mariana Montes

Doctor in Linguistics

Related

A project from scratch

Initialize a project

The dataset

The analysis

Basic report

Everything into functions

Reporting regression results

Creating a workflow with targets

Reports with targets

What if you change something?

Conclusion

References

Mariana Montes

Doctor in Linguistics

Related

Creating a workflow with `targets`

Reports with `targets`