Tutorial: Running R in Galileo

Written and developed by Matthew Gasperetti

Tutorial: Running R in Galileo

Written and developed by Matthew Gasperetti

Getting started with R in Galileo

To get started with Galileo log into your account using Firefox or Chrome, and download our R example file from GitHub. As you read through this tutorial, we’ll take a look at the files in the example folder, and I’ll explain how everything works.

The downloaded file consists of an .R file, a .csv file, and a Dockerfile. We’ll try running this folder in Galileo, first, and then take a look at what’s happening behind the scenes. 

Now for the fun part let’s look at our files and drag and drop to Galileo

Our rExample contains three files named rMonteCarlo.Rmtcars.csv, and Dockerfile. Dragging and dropping the rExample folder to Galileo will run the code in rMonteCarlo.R, which conducts a linear regression, makes a simple plot, and then runs two Monte Carlo simulations.

The first Monte Carlo simulates tossing two dice and calculates the number of rolls that are 7 or less. The second Monte Carlo increases the number of iterations and runs the simulation in parallel.

When you log into Galileo, the first thing you’ll see is your Dashboard:

View of the Galileo Dashboard

To run the example file, drag and drop the rExample file you downloaded from our GitHub to the Galilei station at the top of the Dashboard:

Drag and drop the rExample folder to the Galilei station

After you drag and drop the rExample folder to Galileo, you’ll be able to see the job running in the Your Recent Jobs panel:

The rExample job takes about 13 seconds to complete

When the example job completes, hit the download button under Action to download the results:

Download button

The results folder will be downloaded as a .zip that contains an output.log file returning the results of the analysis and a folder called filesys where any plots or other files that were created by the analysis are stored.

The Downloaded .zip file contains a folder called filesys and a file called output.log

Let’s take a look at the output.log file first, which returns the results of the regression we ran and the results of our Monte Carlo simulations:

The results of our regression analysis and Monte Carlo experiments

Next, if we look in the filesys folder, we can see the plot that we made:

Getting started with R in Galileo

To get started with Galileo log into your account using Firefox or Chrome, and download our R example file from GitHub. As you read through this tutorial, we’ll take a look at the files in the example folder, and I’ll explain how everything works.

The downloaded file consists of an .R file, a .csv file, and a Dockerfile. We’ll try running this folder in Galileo, first, and then take a look at what’s happening behind the scenes. 

Now for the fun part let’s look at our files and drag and drop to Galileo

Our rExample contains three files named rMonteCarlo.R,
mtcars.csv, and Dockerfile. Dragging and dropping the rExample folder to Galileo will run the code in rMonteCarlo.R, which conducts a linear regression, makes a simple plot, and then runs two Monte Carlo simulations.

The first Monte Carlo simulates tossing two dice and calculates the number of rolls that are 7 or less. The second Monte Carlo increases the number of iterations and runs the simulation in parallel.

When you log into Galileo, the first thing you’ll see is your Dashboard:

View of the Galileo Dashboard

To run the example file, drag and drop the rExample file you downloaded from our GitHub to the Galilei station at the top of the Dashboard:

Drag and drop the rExample folder to the Galilei station

After you drag and drop the rExample folder to Galileo, you’ll be able to see the job running in the Your Recent Jobs panel:

The rExample job takes about 13 seconds to complete

When the example job completes, it the download button under Action to download the results:

Download button

The results folder will be downloaded as a .zip that contains an output.log file returning the results of the analysis and a folder called filesys where any plots or other files that were created by the analysis are stored.

The Downloaded .zip file contains a folder called filesys and a file called output.log

Let’s take a look at the output.log file first, which returns the results of the regression we ran and the results of our Monte Carlo simulations:

The results of our regression analysis and Monte Carlo experiments

Next, if we look in the filesys folder, we can see the plot that we made:

Running your own R files in Galileo—A closer look at how it works

In order to run your own jobs in Galileo, it is important to understand what is happening behind the scenes. Galileo creates a Docker container on a powerful cloud instance and then sends your code to that container via https, executes your code, and then sends the results back via https. 

A closer study of the files in our rExample folder will help illustrate how to modify them so we can run other jobs. After that, we’ll have a look at the Galileo Docker Wizard, which helps automate the process.

How to code the Dockerfile

Let’s quickly review the example Dockerfile, which you can open with a text editor like Atom.

The first thing to notice is that the file is called Dockerfile with no extension. It cannot be called anything else—Dockerfile2, Dockerfile copy, or Dockerfile.txt won’t work.

Looking at the Dockerfile with our text editor, the first Docker command we see is:

FROM rocker/r-apt:bionic

This tells Docker how to setup the R environment that we’ll be using. We want to leave it as is. 

The second Docker command we see is:

RUN R -e ‘options(Ncpus = 32)’

This command tells R how many CPUs to use—32 in this case. R is very good at parallelizing code, but if you max out the number of CPUs, there won’t be enough processing power to manage the parallelization. As a result, it is best to leave at least two CPUs free.

Next, we see the R command install.packages(“parallel”) wrapped in a Docker command:

RUN R -e ‘install.packages(“parallel”)’

This command installs your R packages from source to the Docker container we are creating. This can take a while with packages that are large, but once the package is installed and the Docker container is built, it will run immediately the next time.

To install multiple packages from source, copy and paste the command above and modify it by adding the name of the package you want, say cluster, on a new line like this:

RUN R -e ‘install.packages(“parallel”)’
RUN R -e ‘install.packages(“cluster”)’

The next line of code is commented out, but shows how to install a package from binary, which is much faster for large packages like dplyr, caret, or ggplot2. All I’d have to do to execute the code in Docker, is remove the hashmark # like in R. 

#RUN apt-get update && apt-get install -y -qq r-cran-parallel

The problem is that there is no binary for the parallel package on cran, so if I uncomment this command and drag and drop the rExample folder to Galileo again, we’ll get an error. 

The other thing to be aware of is that the binary package name must be all in lower case letters, or it will produce an error—CausalImpact must be written, causalimpact or it won’t work. For example, if uncommented, this code will cause an error:

#RUN apt-get update && apt-get install -y -qq r-cran-CausalImpact

Whereas, this code will not:

RUN apt-get update && apt-get install -y -qq r-cran-causalimpact

To add multiple binary packages to your Docker file, copy and paste the code above to a new line, and modify it by adding the name of the new command at the end like this:

RUN apt-get update && apt-get install -y -qq r-cran-dplyr
RUN apt-get update && apt-get install -y -qq r-cran-ggplot2

Now that we understand installing packages, let’s look at the next line of code we see in our Docker file:

COPY . .

This tells Docker where to look for, and where to save, our files and should be left as is.

The final command is:

ENTRYPOINT [“Rscript”,“rMonteCarlo.R”]

This tells Docker that we are running an Rscript called rMonteCarlo.R. To run a Rscript called myProject.R, we’d use the following code:
ENTRYPOINT [“Rscript”,“myProject.R”]

Here is the Dockerfile for rExample in its entirety with comments:

#The line below determines the build image to use
FROM rocker/r-apt:bionic
#The next block determines what dependencies to load
RUN R -e ‘options(Ncpus = 32)’
#There are two ways to install packages – this method installs from source
RUN R -e ‘install.packages(“parallel”)’
#This method installs from binaries and is faster but not every package has a binary
#RUN apt-get update && apt-get install -y -qq r-cran-parallel
#This line determines where to copy project files from, and where to copy them to
COPY . .
#The entrypoint is the command used to start your project
ENTRYPOINT [“Rscript”,”rMonteCarlo.R”]

Now, Let’s have a look at our .R file

After opening the file rMonteCarlo.R, the first thing to notice is that we DO NOT use the install.packages() command in the R file. Instead, we used Docker to install our packages. 

However, in the R file we do have to load the packages using the library() command like so:

library(parallel)

The other important thing to note is that we call the dataset we are using, mtcars.csv, like it is in our working directory with the following command:

mtcars <- read.csv(“mtcars.csv”)

Notice the path is relative not absolute. The code below will NOT work and will cause an error:

mtcars <- read.csv(“/Users/Matthew/mtcars.csv”)

Let’s look at the dataset next

This is just the standard mtcars.csv dataset that is used in a lot of applied examples and tutorials. There’s nothing special here. I just want to show how to call data correctly from your Rscript.

Running your own R files in Galileo—A closer look at how it works

In order to run your own jobs in Galileo, it is important to understand what is happening behind the scenes. Galileo creates a Docker container on a powerful cloud instance and then sends your code to that container via https, executes your code, and then sends the results back via https. 

A closer study of the files in our rExample folder will help illustrate how to modify them so we can run other jobs. After that, we’ll have a look at the Galileo Docker Wizard, which helps automate the process.

How to code the Dockerfile

Let’s quickly review the example Dockerfile, which you can open with a text editor like Atom.

The first thing to notice is that the file is called Dockerfile with no extension. It cannot be called anything else—Dockerfile2, Dockerfile copy, or Dockerfile.txt won’t work.

Looking at the Dockerfile with our text editor, the first Docker command we see is:

FROM rocker/r-apt:bionic

This tells Docker how to setup the R environment that we’ll be using. We want to leave it as is. 

The second Docker command we see is:

RUN R -e ‘options(Ncpus = 32)’

This command tells R how many CPUs to use—32 in this case. R is very good at parallelizing code, but if you max out the number of CPUs, there won’t be enough processing power to manage the parallelization. As a result, it is best to leave at least two CPUs free.

Next, we see the R command

nstall.packages(“parallel”)

wrapped in a Docker command:

RUN R -e ‘install.packages(“parallel”)’

This command installs your R packages from source to the Docker container we are creating. This can take a while with packages that are large, but once the package is installed and the Docker container is built, it will run immediately the next time.

To install multiple packages from source, copy and paste the command above and modify it by adding the name of the package you want, say cluster, on a new line like this:

RUN R -e ‘install.packages(“parallel”)’
RUN R -e ‘install.packages(“cluster”)’

The next line of code is commented out, but shows how to install a package from binary, which is much faster for large packages like dplyr, caret, or ggplot2. All I’d have to do to execute the code in Docker, is remove the hashmark # like in R. 

#RUN apt-get update && apt-get install -y -qq r-cran-parallel

The problem is that there is no binary for the parallel package on cran, so if I uncomment this command and drag and drop the rExample folder to Galileo again, we’ll get an error. 

The other thing to be aware of is that the binary package name must be all in lower case letters, or it will produce an error—CausalImpact must be written, causalimpact or it won’t work. For example, if uncommented, this code will cause an error:

#RUN apt-get update && apt-get install -y -qq r-cran-CausalImpact

Whereas, this code will not:

RUN apt-get update && apt-get install -y -qq r-cran-causalimpact

To add multiple binary packages to your Docker file, copy and paste the code above to a new line, and modify it by adding the name of the new command at the end like this:

RUN apt-get update && apt-get install -y -qq r-cran-dplyr
RUN apt-get update && apt-get install -y -qq r-cran-ggplot2

Now that we understand installing packages, let’s look at the next line of code we see in our Docker file:

COPY . .

This tells Docker where to look for, and where to save, our files and should be left as is.

The final command is:

ENTRYPOINT [“Rscript”,“rMonteCarlo.R”]

This tells Docker that we are running an Rscript called rMonteCarlo.R. To run a Rscript called myProject.R, we’d use the following code:
ENTRYPOINT [“Rscript”,“myProject.R”]

Here is the Dockerfile for rExample in its entirety with comments:

#The line below determines the build image to use
FROM rocker/r-apt:bionic
#The next block determines what dependencies to load
RUN R -e ‘options(Ncpus = 32)’
#There are two ways to install packages – this method installs from source
RUN R -e ‘install.packages
(“parallel”)’
#This method installs from binaries and is faster but not every package has a binary
#RUN apt-get update && apt-get install -y -qq r-cran-parallel
#This line determines where to copy project files from, and where to copy them to
COPY . .
#The entrypoint is the command used to start your project
ENTRYPOINT [“Rscript”,”rMonteCarlo.R”]

Now, Let’s have a look at our .R file

After opening the file rMonteCarlo.R, the first thing to notice is that we DO NOT use the install.packages() command in the R file. Instead, we used Docker to install our packages. 

However, in the R file we do have to load the packages using the library() command like so:

library(parallel)

The other important thing to note is that we call the dataset we are using, mtcars.csv, like it is in our working directory with the following command:

mtcars <- read.csv(“mtcars.csv”)

Notice the path is relative not absolute. The code below will NOT work and will cause an error:

mtcars <- read.csv (“/Users/Matthew/mtcars.csv”)

Let’s look at the dataset next

This is just the standard mtcars.csv dataset that is used in a lot of applied examples and tutorials. There’s nothing special here. I just want to show how to call data correctly from your Rscript.

Using the Docker Wizard to create your own project

If you drag and drop a folder to Galileo that only contains an .R file and a .csv file, but no Dockerfile, you will see a Docker Wizard prompt:

The Docker Wizard helps automate creating a Docker file

To create a Docker file for an Rscript called myProject.R that sets the number of CPUs available for multithreading to 32 and loads the clusterggplot2, and dplyr packages, enter the following settings into the Docker Wizard:

An example showing how to use Galileo’s Docker Wizard

Once you complete your custom Dockerfile, make sure to add it to the project folder containing your myProject.R script and your data. Your folder should look like this:

Your project folder should contain your R script, your data, and your Dockerfile

Now that your folder looks right, drag and drop the folder onto Galilei in your Dashboard at https://app.galileoapp.io

Using the Docker Wizard to create your own project

If you drag and drop a folder to Galileo that only contains an .R file and a .csv file, but no Dockerfile, you will see a Docker Wizard prompt:

The Docker Wizard helps automate creating a Docker file

To create a Docker file for an Rscript called myProject.R that sets the number of CPUs available for multithreading to 32 and loads the clusterggplot2, and dplyr packages, enter the following settings into the Docker Wizard:

An example showing how to use Galileo’s Docker Wizard

Once you complete your custom Dockerfile, make sure to add it to the project folder containing your myProject.R script and your data. Your folder should look like this:

Your project folder should contain your R script, your data, and your Dockerfile

Now that your folder looks right, drag and drop the folder onto Galilei in your Dashboard at https://app.galileoapp.io

I hope this tutorial was helpful. Please let me know if you have any questions or any problems using Galileo. Your feedback is extremely important to us. Contact me anytime at matthew@hypernetlabs.io.