This lecture is on building R data packages. This is a relatively advanced topic, but I think it is important and can be learned quickly. This lecture assumes you have a good working knowledge of R and your computer file system. You do not need to be an expert in R to do this.
Why would you want to build a data package? Here are the stages your data will pass through:
The data package is meant to hold processed data and the code you used to generate it from the raw data. Since these two things are related, it is natural to keep them together in the same digital file. Often data processing is computer resource or time-intensive and so you don’t want to repeat it every time you interact with your work. Some analyses (like single cell RNAseq dimension reduction) are not absolutely reproducible in the strictest sense and so saving a processed data object means you are always starting from the same place with your subsequent analysis.
What should and shouldn’t be in a data package? This you will have to decide for yourself. There is some overhead (effort) to making a data package, so you won’t want to save things that can be very rapidly calculated or derived from other data objects. Generally you want a data package to be no more than a few GB in size so you may need to be selective with what you save. If it takes more than 3-4 seconds to calculate data for a table or figure, then I will usually put it in the data package. Here are the types of things I do and don’t include
Do include:
Do not include:
In summary, the benefits of using a data package are:
Once you have identified a data object you would like to put into a
package, you first need to save it to disk as a .rda
file.
Usually you are going to be starting from within an analysis project.
Here are some rules/best practices:
# some demo data to save
demo_iris_data <- iris
# make a data directory
dir.create("data")
# save the data as a .rda file
save(demo_iris_data,
file = "data/demo_iris_data.rda",
compress = "bzip2")
At this point you could just stop and then read in the data later
when needed using load("data/demo_iris_data.rda")
. But it
is better to put this into a package and then install the package in
your project for the reasons mentioned above.
First you might want to update your ~/.Rprofile. This will add your contact info to the package.
options(
usethis.full_name = "Jane Doe",
usethis.protocol = "ssh",
usethis.description = list(
"Authors@R" = utils::person(
"Jane", "Doe",
email = "[email protected]",
role = c("aut", "cre"),
comment = c(ORCID = "JANE'S-ORCID-ID")
),
Version = "0.0.0.9000"
),
usethis.destdir = "~/the/place/where/I/keep/my/R/projects",
usethis.overwrite = TRUE
)
Then use the dedicated function from blaseRtemplates to set up the package project.
As with a regular project, you want to make a readme and set it up to use git using the initialization script:
# make a software license
usethis::use_mit_license("<your name here>")
# generate a readme file to explain your work
usethis::use_readme_md(open = FALSE)
# *** Only if developing a package ***
# uncomment and run to generate a news file to document updates.
usethis::use_news_md()
# set your default branch to "main" for git init
system("git config --global init.defaultBranch main")
# initialize git
usethis::use_git()
# initialize github
usethis::use_github(private = TRUE)
### Delete this file after initializing the project! ###
A package is like a project but with a few additional requirements, like a DESCRIPTION file. The initialize_package function will set that up for you. Here is how it will look when you are done:
.
|-- blaser.park.datapkg.Rproj
|-- data
| |-- cds_heme_combined.rda
| |-- cds_heme_combined_tm.csv
| `-- human_hsc_pseudobulk_res.rda
|-- DESCRIPTION
|-- inst
| |-- data-raw
| | |-- git_commands.R
| | |-- hsc_pseudobulk.R
| | `-- reprocess_cds.R
| `-- extdata
| |-- cds_marrow_heme_models
| | |-- file_index.rds
| | |-- rdd_pca_transform_model.rds
| | |-- rdd_umap_transform_model_annoy.idx
| | |-- rdd_umap_transform_model.rds
| | `-- rdd_umap_transform_model_umap.idx
| `-- gene_targets.csv
|-- library_catalogs
| `-- blas02_blaser.park.datapkg.tsv
|-- LICENSE
|-- LICENSE.md
|-- man
| |-- cds_heme_combined.Rd
| `-- human_hsc_pseudobulk_res.Rd
|-- NAMESPACE
|-- NEWS.md
|-- R
| `-- data.R
`-- README.md
8 directories, 23 files
Some important things to note
This is very simple. Just move the data directory from the analysis project into the root directory of the new data package. If you are editing an existing package, just move the new .rda files. You can only have 1 data directory.
It is good to keep the processed data with the code you used to process it. That way if there is a problem you can track it down easily. This code should go in a new directory called “data-raw”. This is an unfortunate name because there is no data in there, only code.
You need to enclose this in a directory called “inst”. Everything in the inst directory gets installed with your package.
If you want to actually include the raw data or other files with your package, make a directory called “inst/extdata” and put them in there.
Documenting your data is useful to yourself and others.
All objects in your data folder need to have a documentation entry. Documentation is done in a structured text language called Roxygen which is very easy to work with.
The easiest way to start is with annotating a simple data frame. The
sinew
package will help you get the format correct.
Copy this from the console then just fill in the blanks. Usually the most important things are to have a clear, descriptive, unique title and then in the description line, to describe how the data was generated and where to find the code for it. For example: This is a standard data set distributed with R, with modifications. See data-raw/file_with_code_you_used.R.
Sinew only works for data frames and functions. If you want to document other types of objects, you can just copy/paste the format.
Make sure you remember to increment the version of the data in the DESCRIPTION file. Then jot a few notes about what changed in NEWS.md.
Finally you need to tell R to generate the documentation files and build the binary data package.
# generate the formatted documentation manuals
devtools::document()
# optionally you can now commit and push to github using the terminal
# build the binary data package
devtools::build()
That’s it. When you run the last command, a .tar.gz file will be generated in the directory enclosing your data package code. You can move that wherever you like. Better yet you can provide a file path where R will save the binary package. I usually save to network storage. You want to be sure to increment your version number each time you make changes so you don’t overwrite old versions of your data. It is always good to be able to go back in time if needed.
The package can easily be shared inside your firewall with local collaborators. For external collaborators or reviewers, you can use things like Mendeley, Zotero, and Dryad to share large data packages. Each has a straightforward web interface and the option to keep data private until the manuscript is published.
The packages we make may be a little different from typical R packages. They have no functions which is somewhat unusual but the biggest difference is size. If you are working with single cell data, it is likely that the size of your data package will exceed what R can handle with its normal mechanisms for loading data.
Instead we use functions from a couple of packages to load the data into your R session as a digital “pointer”. Until you ask R to reference a particular data item, it will reside in memory as a tiny digital address to an area on your hard drive. When you ask R to reference those data with you code, it is then loaded into memory to be used. This has the additional advantage of reducing memory requirements for your work.
The function will go to the directory where your data lives, check for the latest version, and install it if necessary. This is the recommended way to go because you almost always want the latest version and you don’t want to waste time installing it if necessary.
Then the function loads the digital pointers into your R session. Check out the environment dropdown menu to see for yourself.
By now you should be comfortable setting up an R project, analyzing your data and making figures for a manuscript or presentation.
blaseRtemplates::project_data()
.