Project

General

Profile

Input data preparation library

NOTE: This page is outdated.

With the moinput ("Model Operations INPUT") data preparation library we aim to harmonize the entire input data preparation process for regional data for models like REMIND and MAgPIE. Having this harmonization it will be possible to automatically aggregate the given data to a chosen regional aggregation scheme (instead of sticking to the so far hard-coded regions), to easily update input data and to easier understand the conversion routines others have written. The moinput library is an R package.

Since very specific read- and conversion-functions are required for the different data sources and specific calculations have to be performed for the different model inputs only the meta-structure could be standardized, so most of the code has still to be written by the experts for every given data set. To this end, the moinput library will provide various help functions making the development of this code an easier task. This will be explained in detail later.

Preparing a test environment on your computer

There are two parts of data involved here: the moinput R-library (here you will create new functions) and the data folders (where the raw data that has to be processed is stored). Update (or install) the newest moinput and madrat library. Replicate the data folder structure locally by either copying the (found on the PIK network) folder /p/projects/rd3mod/inputdata (6.59 GB) [or by copying and unzipping the (also found on the PIK network) folder /scratch/01/baumstark/MO/input_preparation.zip (2.2GB) current location??] to any location on your computer. Load the moinput project (moinput.Rproj) in RStudio and set the working directory to the folder with the data sources. Then load the moinput library.

library(moinput)

It will automatically recognize the data source folders and configure itself accordingly. Use getConfig() to show the actual configuration and setConfig(...) to modify it.
You can also work in any other interactive R session (in case you don't use RStudio), just follow the same steps from above.

> getConfig()

Initialize moinput config with default settings..
    regionmapping = regionmappingREMIND.csv
    verbosity = 1
    enablecache = TRUE
    mainfolder = C:/Users/dklein/Documents/1_Projekte/MO/MO-Input/input_preparation
    sourcefolder = NA
    cachefolder = NA
    mappingfolder = NA
    outputfolder = NA
    pop_threshold = 1e+06
    forcecache = FALSE
..done!

$regionmapping
[1] "regionmappingREMIND.csv" 

$verbosity
[1] 1

$enablecache
[1] TRUE

$mainfolder
[1] "C:/Users/dklein/Documents/1_Projekte/MO/MO-Input/input_preparation" 

$sourcefolder
[1] "C:/Users/dklein/Documents/1_Projekte/MO/MO-Input/input_preparation/sources" 

$cachefolder
[1] "C:/Users/dklein/Documents/1_Projekte/MO/MO-Input/input_preparation/cache" 

$mappingfolder
[1] "C:/Users/dklein/Documents/1_Projekte/MO/MO-Input/input_preparation/mappings" 

$outputfolder
[1] "C:/Users/dklein/Documents/1_Projekte/MO/MO-Input/input_preparation/output" 

$pop_threshold
[1] 1e+06

$forcecache
[1] FALSE

Working on PIK's cluster (i.e., the login nodes) it should be sufficient to load the library with library(moinput). The library will be loaded with default settings which should be fine for testing.

If you want to change settings, e.g. the location of the input data archive or the region mapping that should be used for aggregation, you can use the function setConfig()

# change path to data archive and regionmapping
setConfig(mainfolder="my/own/folder", regionmapping="regionmappingCustom.csv")

Main steps and structure of the code

The conversion from input sources to model inputs is split i two parts:
  1. Read-in and conversion of source files to ISO countries and
  2. Calculation of model inputs in ISO country resolution and aggregation to the model regions.

structure

The R functions required for input data preparation and their hierarchy are shown below and explained step by step. There are two types of functions: already existing wrapper functions (you do not need to modify) depicted in blue. The wrapper functions contain generic code that is common for the processing steps of all input data. The new functions you provide (indicated by a leading dot in their name) contain code that is different across input sources and that can not be generalized. All functions are stored in the moinput library. The arrows indicate which function calls which function. On the right hand side you find example code for the relevant functions. Please note: !!! Never call your new functions directly! Use the wrapper functions only to call your new functions !!! (see the examples on the left side).

Reading a source (readSource.R)

Reading a source is done with the function readSource(type, subtype=NULL, convert=TRUE) where type is the name of the source (e.g. "FAO") and subtype can be a subcategory of a source if the source has any subcategories (e.g. "Crop" for source "FAO"). If the argument convert is set to FALSE the data is only read in R and converted to a magclass object. If convert is set to TRUE the data will be in addition spatially aggregated or disaggregated to a standardized ISO country level.

readSource itself is only a wrapper for source-specific functions doing the actual calculations. If for instance the source "FAO" with subtype "Crop" should be read in readSource("FAO","Crop") will then execute .readFAO("Crop") which is written specifically for the FAO database. After execution of .readFAO("Crop") readSource will perform some consistency checks, e.g. it will make sure that the data is returned as a magclass object.
For the conversion to the given ISO country level another source-specific function is executed: .convertFAO(x). Again consistency checks are performed making sure that the source-specific function is providing the expected output.

If you want to introduce a new source to the library you have to do the following steps:

  1. Create a subfolder in the data source folder with the name of the source and put all input files into this folder
  2. Write a source specific read-function (e.g. ".readBlablub" if your source has the name "Blablub")
  3. Write a source specific convert-function (e.g. ".convertBlablub")

To access the required data input files within the functions just type the name of the files as both functions (read... and convert...) are executed from within the source data folder.

After these three steps you get the converted source data by the following commands:

library(moinput)
a <- readSource("Blablub")

Calculation of model inputs (calcOutput.R)

Calculating the data that is input to the model works in the same way as reading the source files. Again there is a wrapper function "calcOutputs(type)" calling specific sub-functions provided by the user (e.g. .calcBlabla) and performing consistency checks. In addition calcOutput aggregates the data to the world regions defined in the config file. If you do not want to aggregate the data set the aggregation argument to FALSE: calcOutput("Blabla",aggregate=FALSE).

Please be aware that there is no 1:1 relation from source files to output files: one source file might result in more than one output files, or some outputs might base on several source files!

Adding a new model input is done by writing your own calcOutput-function, e.g. for the data set "Blabla" write the function .calcBlabla which should then look something like that:

.calcBlabla <- function(){
  a <- readSource("Blablub")
  #do some fancy calculations
  return(list(x=x,weight=weight))
}

In contrast to the read- and convert-functions the calc-functions have to return a list of 2 magclass objects: the data itself (x) and a weight (weight) for aggregating the ISO countries to model regions. Typically, the weight has the same dimensionality as x, but there are some exceptions (e.g. if the data is summed up instead of taking a weighted average or if the same weight is applied to more than one column of x). A full overview of allowed formats of the weight can be found in the help file of the speed_aggregate-function in the luscale library (which is the function doing the regional aggregation).

library(luscale)
?speed_aggregate

As shown in the example above you can directly access a source with readSource in the calculation function. As probably later on many calculation functions will access the same source it would be time consuming if always the whole read process was executed. Therefore, a cache_folder as defined in the config file stores intermediate results so that the reading of source data has to be performed once only.

Different to the read- and convert-functions the calculation functions can have custom arguments if required for the calculations. .calcPopulation for instance has the custom arguments PopulationCalib, PopulationPast and PopulationFuture.

Full preprocessing for REMIND and MAgPIE

Use the fullMAgPIE() or fullREMIND() function to preprocess all inputs needed for a specific model:

library(moinput)
fullMAgPIE()

The fullMAgPIE and fullREMIND functions call the calcOutput functions, which call the readSource functions.

After the model input has been generated, it is stored in the output subfolder of your source data folder. Copying the input data into the right folders in the model is done by the function copyInput. And the full coordination of interaction between models and moinput is done with retrieveInput. With this function one can specify the model, a regionmapping which should be used and the location of the model which should be modified and the function will do the rest (make sure that the required data is calculated and copying the data to the right locations).

The current database is stored at /data/rd3/inputdata

Step-by-step ToDo list for adding new input data routines to moinput library:

  1. Check whether or not the necessary source data is already included in the moinput library. If yes, skip next step.
  2. Store new data sources for "NewSource" in /p/projects/rd3mod/inputdata on the cluster (for testing purposes, it might also be helpful to reproduce this structure locally).
  3. In the library moinput, create corresponding readNewSource.R and convertNewSource.R files as well as the corresponding .Rd help files (often useful to take a copy of existing examples, but then make sure to replace everything, especially all instances of OldSource with NewSource)
  4. Create calcNewInputParameter.R script that calculates the new model input from the source data, plus the corresponding .Rd help file.
  5. Include new calcOutput("NewInputParameter") line into fullREMIND.R function.
  6. Check the library
  7. Re-build the library via command line (Rscript buildLibrary.R moinput)
  8. Submit to R-library

Note: it is very useful to open the moinput.Rproj project in Rstudio while working on the new functions. After "build and reload" you can test, whether or not the new functions work as desired. Note the nomenclature to call the functions:
readSource("NewSource") to call .readNewSource, and calcOutput("NewInputParameter") to call .calcNewInputParameter.