Functions Explained (self study)

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How can I write a new function in R?

How do we write functions that work with dplyr?

Objectives

Explain why we should divide programs into small, single-purpose functions.

Define a function that takes arguments.

Return a value from a function.

Understand the benefits and drawbacks of non standard evaluation

This episode will form part of the forthcoming “Programming in R” course. It is included in this course for self study. If you are interested in taking part in a pilot of the course, (provisionally scheduled for March/April 2018, please contact david.mawdsley@manchester.ac.uk)

If we only had one data set to analyse, it would probably be faster to load the file into a spreadsheet and use that to plot simple statistics. However, the gapminder data is updated periodically, and we may want to pull in that new information later and re-run our analysis again. We may also obtain similar data from a different source in the future.

In this episode, we’ll learn how to write a function so that we can repeat several operations with a single command.

What is a function?

Functions gather a sequence of operations into a whole, preserving it for ongoing use. Functions provide:

a name we can remember and invoke it by

relief from the need to remember the individual operations

a defined set of inputs and expected outputs

rich connections to the larger programming environment

As the basic building block of most programming languages, user-defined functions constitute “programming” as much as any single abstraction can. If you have written a function, you are a computer programmer.

Defining a function

Let’s create a new R script file in the src/ directory and call it functions-lesson.R:

my_sum <- function(a, b) {
  the_sum <- a + b
  return(the_sum)
}

We can use the function in the same way as any other functions we’ve used so far in this course:

my_sum(2,3)

[1] 5

my_sum(10,10)

[1] 20

Now let’s define a function fahr_to_kelvin that converts temperatures from Fahrenheit to Kelvin:

fahr_to_kelvin <- function(temp) {
  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
  return(kelvin)
}

We define fahr_to_kelvin by assigning it to the output of function. The list of argument names are contained within parentheses. Next, the body of the function—the statements that are executed when it runs—is contained within curly braces ({}). The statements in the body are indented by two spaces. This makes the code easier to read but does not affect how the code operates.

When we call the function, the values we pass to it as arguments are assigned to those variables so that we can use them inside the function. Inside the function, we use a return statement to send a result back to whoever asked for it.

Tip

One feature unique to R is that the return statement is not required. R automatically returns whichever variable is on the last line of the body of the function. But for clarity, we will explicitly define the return statement.

Let’s try running our function. Calling our own function is no different from calling any other function:

# freezing point of water
fahr_to_kelvin(32)

[1] 273.15

# boiling point of water
fahr_to_kelvin(212)

[1] 373.15

Challenge 1

Write a function called kelvin_to_celsius that takes a temperature in Kelvin and returns that temperature in Celsius

Hint: To convert from Kelvin to Celsius you subtract 273.15
Solution to challenge 1

Write a function called kelvin_to_celsius that takes a temperature in Kelvin and returns that temperature in Celsius
kelvin_to_celsius <- function(temp) {
 celsius <- temp - 273.15
 return(celsius)
}

Combining functions

The real power of functions comes from mixing, matching and combining them into ever-larger chunks to get the effect we want.

Challenge 2

Define the function to convert directly from Fahrenheit to Celsius, by using the fahr_to_kelvin function defined earlier, and the kelvin_to_celsius function you wrote in challenge 1.
Solution to challenge 2

Define the function to convert directly from Fahrenheit to Celsius, by reusing these two functions above
fahr_to_celsius <- function(temp) {
  temp_k <- fahr_to_kelvin(temp)
  result <- kelvin_to_celsius(temp_k)
  return(result)
}

Tip: Function scope

Another important concept is scoping: any variables (or functions!) you create or modify inside the body of a function only exist for the lifetime of the function’s execution. When we call calcGDP, the variables dat, gdp and new only exist inside the body of the function. Even if we have variables of the same name in our interactive R session, they are not modified in any way when executing a function.

Using functions

If you’ve been writing these functions down into a separate R script , you can load in the functions into our R session by using the source function:

source("src/functions-lesson.R")

It is a good idea to separate your analysis code and the functions it calls into separate files. You can then source() your functions file at the start of your analysis script.

Programming with dplyr

A lot of the “cleverness” of the tidyverse comes from how it handles the values we pass to its functions. This lets us write code in a very expressive way (for example, the idea of treating the dplyr pipelines as a sentence consisting of a series of verbs, with the %>% operator being read as “and then”.)

Unfortunately this comes at a price when we come to use dplyr in our own functions. The dplyr functions use what is known as Non Standard Evaluation; this means that they don’t evaluate their parameters in the standard way that the examples in this episode have used so far. This is probably the most complicated concept we will cover today, so please don’t worry if you don’t “get” it at first attempt.

We’ll illustrate this by way of an example.

Let’s write a function that will filter our gapminder data by year and calculate the GDP of all the countries that are returned. First, let’s think about how we’d do this without using a function. We’ll then put our code in a function; this will let us easily repeat the calculation for other years, without re-writing the code.

This saves us effort, but more importantly it means that the code for calculating the GDP only exists in a single place. If we find we’ve used the wrong formula to calculate GDP we would only need to change it in a single place, rather than hunting down all the places we’ve calculated it. This important idea is known as “Don’t Repeat Yourself”.

We can filter and calculate the GDP as follows:

gdpgapfiltered <- gapminder %>%
    filter(year == 2007) %>% 
    mutate(gdp = gdpPercap * pop)

Note that we filter by year first, and then calculate the GDP. Although we could reverse the order of the filter() and mutate() functions, it is more efficient to filter the data and then calculate the GDP on the remaining data.

Let’s put this code in a function; note that we change the input data to be the parameter we pass when we call the function, and our hard-coded year to be the year parameter (it’s really easy to forget to make these changes when you start making functions from existing code):

calc_GDP_and_filter <- function(dat, year){
  
  gdpgapfiltered <- dat %>%
      filter(year == year) %>% 
      mutate(gdp = gdpPercap * pop)
  
  return(gdpgapfiltered)
  
}

calc_GDP_and_filter(gapminder, 1997)

# A tibble: 1,704 x 7
   country      year      pop continent lifeExp gdpPercap          gdp
   <chr>       <int>    <dbl> <chr>       <dbl>     <dbl>        <dbl>
Afghanistan  1952  8425333 Asia         28.8      779.  6567086330.
Afghanistan  1957  9240934 Asia         30.3      821.  7585448670.
Afghanistan  1962 10267083 Asia         32.0      853.  8758855797.
Afghanistan  1967 11537966 Asia         34.0      836.  9648014150.
Afghanistan  1972 13079460 Asia         36.1      740.  9678553274.
Afghanistan  1977 14880372 Asia         38.4      786. 11697659231.
Afghanistan  1982 12881816 Asia         39.9      978. 12598563401.
Afghanistan  1987 13867957 Asia         40.8      852. 11820990309.
Afghanistan  1992 16317921 Asia         41.7      649. 10595901589.
Afghanistan  1997 22227415 Asia         41.8      635. 14121995875.
# ... with 1,694 more rows

The function hasn’t done what we’d expect; we’ve successfully passed in the gapminder data, via the dat parameter, but it looks like the year parameter has been ignored.

Look at the line:

    filter(year == year) %>% 

We passed a value of year into the function when we called it. R has no way of knowing we’re referring to the function parameter’s year rather than the tibble’s year variable. We run into this problem because of dplyr’s “Non Standard Evaluation” (NSE); this means that the functions don’t follow R’s usual rules about how parameters are evaluated.

NSE is usually really useful; when we’ve written things like gapminder %>% select(year, country) we’ve made use of non standard evaluation. This is much more intuitive than the base R equivalent, where we’d have to write something like gapminder[, names(gapminder) %in% c("year", "country")]. Non standard evaluation lets the select() function work out that we’re referring to variable names in the tibble. Unfortunately this simplicity comes at a price when we come to write functions. It means we need a way of telling R whether we’re referring to a variable in the tibble, or a parameter we’ve passed via the function.

If we’re using standard evaluation, then we see the value of the year parameter. For example, let’s put a print() statement in our function:

calc_GDP_and_filter <- function(dat, year){
  # For debugging
  print(year)
  
  gdpgapfiltered <- dat %>%
      filter(year == year) %>% 
      mutate(gdp = gdpPercap * pop)
  
  return(gdpgapfiltered)
  
}

calc_GDP_and_filter(gapminder, 1997)

[1] 1997

# A tibble: 1,704 x 7
   country      year      pop continent lifeExp gdpPercap          gdp
   <chr>       <int>    <dbl> <chr>       <dbl>     <dbl>        <dbl>
Afghanistan  1952  8425333 Asia         28.8      779.  6567086330.
Afghanistan  1957  9240934 Asia         30.3      821.  7585448670.
Afghanistan  1962 10267083 Asia         32.0      853.  8758855797.
Afghanistan  1967 11537966 Asia         34.0      836.  9648014150.
Afghanistan  1972 13079460 Asia         36.1      740.  9678553274.
Afghanistan  1977 14880372 Asia         38.4      786. 11697659231.
Afghanistan  1982 12881816 Asia         39.9      978. 12598563401.
Afghanistan  1987 13867957 Asia         40.8      852. 11820990309.
Afghanistan  1992 16317921 Asia         41.7      649. 10595901589.
Afghanistan  1997 22227415 Asia         41.8      635. 14121995875.
# ... with 1,694 more rows

We can use the calc_GDP_and_filter’s year parameter like a normal variable in our function except when we’re using it as part of a parameter to a dplyr verb (e.g. filter), or another function that uses non standard evaluation. We need to unquote the year parameter so that the dplyr function can see its contents (i.e. 1997 in this example). We do this using the UQ() (“unquote”) function:

    filter(year == UQ(year) ) %>% 

When the filter function is evaluated it will see:

    filter(year == 1997) %>% 

Modifying our function, we have:

calc_GDP_and_filter <- function(dat, year){
  
gdpgapfiltered <- dat %>%
    filter(year == UQ(year) ) %>% 
    mutate(gdp = gdpPercap * pop)

return(gdpgapfiltered)
  
}

calc_GDP_and_filter(gapminder, 1997)

# A tibble: 142 x 7
   country      year       pop continent lifeExp gdpPercap           gdp
   <chr>       <int>     <dbl> <chr>       <dbl>     <dbl>         <dbl>
Afghanistan  1997  22227415 Asia         41.8      635.  14121995875.
Albania      1997   3428038 Europe       73.0     3193.  10945912519.
Algeria      1997  29072015 Africa       69.2     4797. 139467033682.
Angola       1997   9875024 Africa       41.0     2277.  22486820881.
Argentina    1997  36203463 Americas     73.3    10967. 397053586287.
Australia    1997  18565243 Oceania      78.8    26998. 501223252921.
Austria      1997   8069876 Europe       77.5    29096. 234800471832.
Bahrain      1997    598561 Asia         73.9    20292.  12146009862.
Bangladesh   1997 123315288 Asia         59.4      973. 119957417048.
Belgium      1997  10199787 Europe       77.5    27561. 281118335091.
# ... with 132 more rows

Our function now works as expected.

Another way of thinking about quoting

(This section is based on programming with dplyr)

Consider the following code:
greet <- function(person){
  print("Hello person")
}

greet("David")
[1] "Hello person"
The print function has no way of knowing that person refers to the variable person, and isn’t part of the string person. To make the contents of the person variable visible to the function we need to construct the string, using the paste function:
greet <- function(person){
  print(paste("Hello", person))
}

greet("David")
[1] "Hello David"
This means that the person variable is evaluated in an unquoted environment, so its contents can be evaluated.

There is one small issue that remains; how does filter know that the first year in filter(year == UQ(year) ) %>% refers to the year variable in the tibble? What happens if we delete the year variable? Surely this should give an error?

gap_noyear <- gapminder %>% select(-year)
calc_GDP_and_filter(gap_noyear, 1997)

# A tibble: 1,704 x 6
   country          pop continent lifeExp gdpPercap          gdp
   <chr>          <dbl> <chr>       <dbl>     <dbl>        <dbl>
Afghanistan  8425333 Asia         28.8      779.  6567086330.
Afghanistan  9240934 Asia         30.3      821.  7585448670.
Afghanistan 10267083 Asia         32.0      853.  8758855797.
Afghanistan 11537966 Asia         34.0      836.  9648014150.
Afghanistan 13079460 Asia         36.1      740.  9678553274.
Afghanistan 14880372 Asia         38.4      786. 11697659231.
Afghanistan 12881816 Asia         39.9      978. 12598563401.
Afghanistan 13867957 Asia         40.8      852. 11820990309.
Afghanistan 16317921 Asia         41.7      649. 10595901589.
Afghanistan 22227415 Asia         41.8      635. 14121995875.
# ... with 1,694 more rows

As you can see, it doesn’t; instead the filter() function will “fall through” to look for the year variable in filter()’s calling environment, This is the calc_GDP_and_filter() environment, which does have a year variable. Since this is a standard R variable, it will be implicitly unquoted, so filter() will see:

    filter(1997 == 1997) %>% 

which is always TRUE, so the filter won’t do anything!

We need a way of telling the function that the first year “belongs” to the data. We can do this with the .data pronoun:

calc_GDP_and_filter <- function(dat, year){
  
gdpgapfiltered <- dat %>%
    filter(.data$year == UQ(year) ) %>% 
    mutate(gdp = .data$gdpPercap * .data$pop)

return(gdpgapfiltered)
  
}

calc_GDP_and_filter(gapminder, 1997)

# A tibble: 142 x 7
   country      year       pop continent lifeExp gdpPercap           gdp
   <chr>       <int>     <dbl> <chr>       <dbl>     <dbl>         <dbl>
Afghanistan  1997  22227415 Asia         41.8      635.  14121995875.
Albania      1997   3428038 Europe       73.0     3193.  10945912519.
Algeria      1997  29072015 Africa       69.2     4797. 139467033682.
Angola       1997   9875024 Africa       41.0     2277.  22486820881.
Argentina    1997  36203463 Americas     73.3    10967. 397053586287.
Australia    1997  18565243 Oceania      78.8    26998. 501223252921.
Austria      1997   8069876 Europe       77.5    29096. 234800471832.
Bahrain      1997    598561 Asia         73.9    20292.  12146009862.
Bangladesh   1997 123315288 Asia         59.4      973. 119957417048.
Belgium      1997  10199787 Europe       77.5    27561. 281118335091.
# ... with 132 more rows

calc_GDP_and_filter(gap_noyear, 1997)

Error in filter_impl(.data, quo): Evaluation error: Column `year` not found in `.data`.

As you can see, we’ve also used the .data pronoun when calculating the GDP; if our tibble was missing either the gdpPercap or pop variables, R would search in the calling environment (i.e. the calc_GDP_and_filter() function). As the variables aren’t found there it would look in the calc_GDP_and_filter()’s calling environment, and so on. If it finds variables matching these names, they would be used instead, giving an incorrect result; if they cannot be found we will get an error. Using the .data pronoun makes our source of the data clear, and prevents this risk.

Challenge: Filtering by country name and year

Create a new function that will filter by country name and by year, and then calculate the GDP.

Solution

 calcGDPCountryYearFilter <- function(dat, year, country){
 dat <- dat %>% filter(.data$year == UQ(year) ) %>% 
       filter(.data$country == UQ(country) ) %>% 
        mutate(gdp = .data$pop * .data$gdpPercap)
        
 return(dat)
}
calcGDPCountryYearFilter(gapminder, year=2007, country="United Kingdom")

# A tibble: 1 x 7
  country         year      pop continent lifeExp gdpPercap     gdp
  <chr>          <int>    <dbl> <chr>       <dbl>     <dbl>   <dbl>
1 United Kingdom  2007 60776238 Europe       79.4    33203. 2.02e12

Tip

The “programming with dplyr” vignette is highly recommended if you are writing functions for dplyr. This can be accessed by typing vignette("programming", package="dplyr")

Tip: Testing and documenting

It’s important to both test functions and document them: Documentation helps you, and others, understand what the purpose of your function is, and how to use it, and its important to make sure that your function actually does what you think.

When you first start out, your workflow will probably look a lot like this:

Write a function

Comment parts of the function to document its behaviour

Load in the source file

Experiment with it in the console to make sure it behaves as you expect

Make any necessary bug fixes

Rinse and repeat.

Formal documentation for functions, written in separate .Rd files, gets turned into the documentation you see in help files. The roxygen2 package allows R coders to write documentation alongside the function code and then process it into the appropriate .Rd files. You will want to switch to this more formal method of writing documentation when you start writing more complicated R projects.

Formal automated tests can be written using the testthat package.

In this episode we’ve introduced the idea of writing your own functions, and talked about the complications caused by non standard evaluation. The forthcoming Research IT course “Programming in R” will cover writing, testing and documenting functions in much more detail. We will notify participants of this course when it is available to book.

Key Points

Use function to define a new function in R.

Use parameters to pass values into functions.

Load functions into programs using source.

previous episode

Data analysis using R

lesson home

Functions Explained (self study)

Overview

What is a function?

Defining a function

Tip

Challenge 1

Solution to challenge 1

Combining functions

Challenge 2

Solution to challenge 2

Tip: Function scope

Using functions

Programming with dplyr

Another way of thinking about quoting

Challenge: Filtering by country name and year

Solution

Tip

Tip: Testing and documenting

Key Points

previous episode

lesson home