Skip to contents

Introduction

IEATools is an R package that provides functions to analyze extended energy balance data from the International Energy Agency (IEA). The IEA’s data are not free. But if you have access to the data, you can follow the instructions in this vignette to analyze it.

Getting started

To get started, data must be obtained and pulled into an R data frame.

Obtain IEA data

Purchase the IEA extended energy balances product from (https://data.iea.org). Download the complete extended energy balance data for at least one country in a .csv file format as shown in the following figure.

IEA extended energy balance data format.

Example data from two countries, Ghana (GHA) and South Africa (ZAF), for two years, 1971 and 2000, are provided in the IEATools package.

library(dplyr)
library(IEATools)
# Define the file location
IEA_path <- file.path("extdata", "GH-ZA-ktoe-Extended-Energy-Balances-sample.csv") %>% 
  system.file(package = "IEATools")
# Or use the function
IEA_path <- sample_iea_data_path()
readChar(IEA_path, nchars = 256)
#> [1] "COUNTRY,FLOW,PRODUCT,1971,2000\nGhana,Production,Hard coal (if no detail),0,x\nGhana,Production,Brown coal (if no detail),0,x\nGhana,Production,Anthracite,..,0\nGhana,Production,Coking coal,..,0\nGhana,Production,Other bituminous coal,..,0\nGhana,Production,Sub-"

Check the integrity of the IEA data file (iea_file_OK())

When starting to work with an IEA data file, it is important to verify its integrity. iea_file_OK() performs This function performs some validation tests on its argument. Specifically, this function confirms that every country has the same FLOW and PRODUCT rows in the same order. Note that the IEA data file is read internally with data.table::fread() without stripping white space. Also, if the IEA data file passes the integrity checks, a data frame representing the file (without modification) is returned.

IEA_path %>% 
  iea_file_OK() %>% 
  glimpse()
#>  logi TRUE

Convert the IEA data into an R data frame (iea_df())

Extended energy balance data can be pulled directly into an R data frame by the iea_df() function. iea_df() fixes the header shown in the figure above, converting it to a single row. Furthermore, iea_df() deals with the IEA’s indicators for not applicable values (“x”), for unavailable values (“..”), and for confidential values (“c”). (See “World Energy Balances: Database Documentation (2018 edition)” at (http://wds.iea.org/wds/pdf/worldbal_documentation.pdf).)

R has three concepts that could be used for “x” and “..”: 0 would indicate value known to be zero. NULL would indicate an undefined value. NA would indicate a value that is unavailable.

In theory, mapping from the IEA’s indicators to R should be done as follows: “..” (unavailable) and “c” (confidential) would be converted to NA in R. “x” (not applicable) would be converted to 0 in R. “NULL” would not be used. However, the IEA are not consistent with their coding. In some places “..” (unavailable) is used for not applicable values, e.g. World Anthracite supply in 1971. (World Anthracite supply in 1971 is actually not applicable (“x”), because Anthracite was classified under “Hard coal (if no detail)” in 1971.) On the other hand, “..” is used correctly for data in the most recent year when those data have not yet been incorporated into the database.

In the face of IEA’s inconsistencies, the only rational way to proceed is to convert “x”, “..”, and “c” to “0”. iea_df() performs that task.

IEA_data <- IEA_path %>% 
  iea_df()
IEA_data %>% 
  glimpse()
#> Rows: 14,688
#> Columns: 5
#> $ COUNTRY <chr> "Ghana", "Ghana", "Ghana", "Ghana", "Ghana", "Ghana", "Ghana",…
#> $ FLOW    <chr> "Production", "Production", "Production", "Production", "Produ…
#> $ PRODUCT <chr> "Hard coal (if no detail)", "Brown coal (if no detail)", "Anth…
#> $ `1971`  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ `2000`  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Next steps

There are several reasons why the IEA extended energy balance data are not convenient for immediate use.

  1. Column titles are SHOUTING.
  2. The COUNTRY column contains full (long) country names. It would be nicer if countries were identified by their 2- or 3-letter ISO codes.
  3. The demarcation between the supply and consumption sides of the ledger is unclear to new users.
  4. How data should be aggregated is unclear to new users.
  5. The data are not in tidy format.

The IEATools package has functions to address each of these issues.

Fix column titles (rename_iea_df_cols())

To unshout the column titles, use the rename_iea_df_cols() function.

IEA_data %>% 
  rename_iea_df_cols() %>% 
  glimpse()
#> Rows: 14,688
#> Columns: 5
#> $ Country <chr> "Ghana", "Ghana", "Ghana", "Ghana", "Ghana", "Ghana", "Ghana",…
#> $ Flow    <chr> "Production", "Production", "Production", "Production", "Produ…
#> $ Product <chr> "Hard coal (if no detail)", "Brown coal (if no detail)", "Anth…
#> $ `1971`  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ `2000`  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Note that the example above uses the rename_iea_df_cols() function without arguments, because the default arguments are correct for the extended energy balance data as it arrives from the IEA. However, both the old and new names can be specified as arguments. If you despise both capitals letters and vowels, you could do the following.

IEA_data %>% 
  rename_iea_df_cols(new_country = "cntry", new_flow = "flw", new_product = "prdct")  %>% 
  glimpse()
#> Rows: 14,688
#> Columns: 5
#> $ cntry  <chr> "Ghana", "Ghana", "Ghana", "Ghana", "Ghana", "Ghana", "Ghana", …
#> $ flw    <chr> "Production", "Production", "Production", "Production", "Produc…
#> $ prdct  <chr> "Hard coal (if no detail)", "Brown coal (if no detail)", "Anthr…
#> $ `1971` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ `2000` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

The default arguments are consistent throughout the IEATools package. Thus, it is recommended that you use the default argument values, wherever possible.

Change to 3-letter ISO country abbreviations (use_iso_countries())

To reduce the length of character strings representing countries, the use_iso_countries() function replaces country character strings with each country’s 3-letter abbreviation.

IEA_data %>% 
  rename_iea_df_cols() %>% 
  use_iso_countries() %>% 
  glimpse()
#> Rows: 14,688
#> Columns: 5
#> $ Country <chr> "GHA", "GHA", "GHA", "GHA", "GHA", "GHA", "GHA", "GHA", "GHA",…
#> $ Flow    <chr> "Production", "Production", "Production", "Production", "Produ…
#> $ Product <chr> "Hard coal (if no detail)", "Brown coal (if no detail)", "Anth…
#> $ `1971`  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ `2000`  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Remove aggregation and memo data (remove_agg_memo_flows())

The IEA’s extended energy balances contain many memos and aggregations in the data itself. For the purposes of manipulations and calculations, it is often advisable to remove all memos and aggregations. The remove_agg_memo_flows() function performs that task.

IEA_data %>% 
  rename_iea_df_cols() %>% 
  filter(Flow == "Total primary energy supply") %>% 
  glimpse()
#> Rows: 0
#> Columns: 5
#> $ Country <chr> 
#> $ Flow    <chr> 
#> $ Product <chr> 
#> $ `1971`  <dbl> 
#> $ `2000`  <dbl>
# Total primary energy supply is an aggregation row,
# so its rows should be absent after calling remove_agg_memo_flows().
IEA_data %>% 
  rename_iea_df_cols() %>% 
  remove_agg_memo_flows() %>% 
  filter(Flow == "Total primary energy supply")
#> [1] Country Flow    Product 1971    2000   
#> <0 rows> (or 0-length row.names)

Augment with ledger side and aggregation point information (augment_iea_df())

The IEA’s extended energy balance data can be confusing for first-time users to understand. In particular, both

  1. the demarcation between the supply and consumption sides of the ledger and
  2. how data should be aggregated

are unclear.

To clarify these issues and make later aggregation calculations possible, call the augment_iea_data() function. augment_iea_df() has many default arguments that work fine for as-delivered IEA extended energy balance data in kilotons of oil equivalent units. augment_iea_df() adds new columns Ledger.side, Flow.aggregation.point, Energy.type, and Unit.

IEA_data %>% 
  rename_iea_df_cols() %>% 
  augment_iea_df() %>% 
  glimpse()
#> Rows: 14,688
#> Columns: 11
#> $ Country              <chr> "Ghana", "Ghana", "Ghana", "Ghana", "Ghana", "Gha…
#> $ Method               <chr> "PCM", "PCM", "PCM", "PCM", "PCM", "PCM", "PCM", …
#> $ EnergyType           <chr> "E", "E", "E", "E", "E", "E", "E", "E", "E", "E",…
#> $ LastStage            <chr> "Final", "Final", "Final", "Final", "Final", "Fin…
#> $ LedgerSide           <chr> "Supply", "Supply", "Supply", "Supply", "Supply",…
#> $ FlowAggregationPoint <chr> "Total primary energy supply", "Total primary ene…
#> $ Flow                 <chr> "Production", "Production", "Production", "Produc…
#> $ Product              <chr> "Hard coal (if no detail)", "Brown coal (if no de…
#> $ Unit                 <chr> "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "…
#> $ `1971`               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ `2000`               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Convert to a tidy data frame (tidy_iea_df())

The tidy format is a data frame where each datum is located on its own row and columns provide metadata for each datum. As delivered, the IEA’s data are not tidy: years are spread to the right rather than being a single column. The tidy_iea_df() function converts to a tidy format. By default, tidy_iea_df() removes zeroes from the data frame, thereby reducing memory footprint.

Tidy_IEA_df <- IEA_data %>% 
  rename_iea_df_cols() %>% 
  use_iso_countries() %>% 
  remove_agg_memo_flows() %>% 
  augment_iea_df() %>% 
  tidy_iea_df()
Tidy_IEA_df %>% 
  glimpse()
#> Rows: 403
#> Columns: 11
#> $ Country              <chr> "GHA", "GHA", "GHA", "GHA", "GHA", "GHA", "GHA", …
#> $ Method               <chr> "PCM", "PCM", "PCM", "PCM", "PCM", "PCM", "PCM", …
#> $ EnergyType           <chr> "E", "E", "E", "E", "E", "E", "E", "E", "E", "E",…
#> $ LastStage            <chr> "Final", "Final", "Final", "Final", "Final", "Fin…
#> $ Year                 <dbl> 1971, 2000, 1971, 2000, 1971, 2000, 2000, 2000, 1…
#> $ LedgerSide           <chr> "Supply", "Supply", "Supply", "Supply", "Supply",…
#> $ FlowAggregationPoint <chr> "Total primary energy supply", "Total primary ene…
#> $ Flow                 <chr> "Production", "Production", "Production", "Produc…
#> $ Product              <chr> "Primary solid biofuels", "Primary solid biofuels…
#> $ Unit                 <chr> "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "…
#> $ Edot                 <dbl> 87399.9985, 162908.9993, 10472.4010, 23792.3995, …

The load_tidy_iea_df() function

For simplicity, all steps above are rolled into load_tidy_iea_df(). By default, load_tidy_iea_df() loads the sample data bundled with the package and converts it to a tidy IEA data frame, assuming default arguments for all functions. But you can supply the path to any IEA data file in the .iea_file argument to load_tidy_iea_df().

At this point, the integrity of the IEA data can be checked.

Check energy balances

The IEA extended energy balance data are usually close to balanced, but not quite. To check the integrity of the energy balances, use the calc_tidy_iea_df_balances() function. calc_tidy_iea_df_balances() adds several new columns to the tidy IEA data frame, including:

  • supply_sum (the sum of all flows on the supply side of the ledger),
  • consumption_sum (the sum of all flows on the consumption side of the ledger),
  • supply_minus_consumption (the difference between supply and consumption),
  • balance_OK (a logical telling whether the energy balance is acceptable), and
  • err (supply sum when consumption_sum is NA or supply_minus_consumption when consumption_sum is not NA).
Balances <- Tidy_IEA_df %>% 
  calc_tidy_iea_df_balances()
Balances %>% 
  glimpse()
#> Rows: 78
#> Columns: 12
#> $ Country                  <chr> "GHA", "GHA", "GHA", "GHA", "GHA", "GHA", "GH…
#> $ Method                   <chr> "PCM", "PCM", "PCM", "PCM", "PCM", "PCM", "PC…
#> $ EnergyType               <chr> "E", "E", "E", "E", "E", "E", "E", "E", "E", …
#> $ LastStage                <chr> "Final", "Final", "Final", "Final", "Final", …
#> $ Year                     <dbl> 1971, 1971, 1971, 1971, 1971, 1971, 1971, 197…
#> $ Product                  <chr> "Aviation gasoline", "Charcoal", "Crude oil",…
#> $ Unit                     <chr> "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "TJ…
#> $ supply_sum               <dbl> 0.0000, 4989.5980, -0.0001, 9845.9971, 3979.8…
#> $ consumption_sum          <dbl> NA, 4989.5980, NA, 9846.0055, 3979.8004, 9136…
#> $ supply_minus_consumption <dbl> NA, 0.000000e+00, NA, -8.400000e-03, -1.00000…
#> $ balance_OK               <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE,…
#> $ err                      <dbl> 0.000000e+00, 0.000000e+00, -1.000000e-04, -8…
Balances %>% 
  tidy_iea_df_balanced()
#> [1] FALSE

Notice that the IEA data are not quite energy balanced. To fix the energy balance, use the fix_tidy_iea_df_balances() function. fix_tidy_iea_df_balances() adjusts the Statistical differences flow to achieve perfect energy balance.

# Fix product-level balances within each country
Tidy_IEA_df %>% 
  fix_tidy_iea_df_balances() %>% 
  calc_tidy_iea_df_balances() %>% 
  glimpse()
#> Rows: 78
#> Columns: 12
#> $ Country                  <chr> "GHA", "GHA", "GHA", "GHA", "GHA", "GHA", "GH…
#> $ Method                   <chr> "PCM", "PCM", "PCM", "PCM", "PCM", "PCM", "PC…
#> $ EnergyType               <chr> "E", "E", "E", "E", "E", "E", "E", "E", "E", …
#> $ LastStage                <chr> "Final", "Final", "Final", "Final", "Final", …
#> $ Year                     <dbl> 1971, 1971, 1971, 1971, 1971, 1971, 1971, 197…
#> $ Product                  <chr> "Aviation gasoline", "Charcoal", "Crude oil",…
#> $ Unit                     <chr> "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "TJ…
#> $ supply_sum               <dbl> 0.0000, 4989.5980, 0.0000, 9846.0055, 3979.80…
#> $ consumption_sum          <dbl> NA, 4989.5980, NA, 9846.0055, 3979.8004, 9136…
#> $ supply_minus_consumption <dbl> NA, 0, NA, 0, 0, 0, NA, NA, 0, 0, 0, 0, 0, NA…
#> $ balance_OK               <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
#> $ err                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Sorting

When manipulating IEA data frames, rows may attain different sorting relative to the standard IEA order. The sort_iea_df() function can return a data frame to IEA sort order.

  Tidy_IEA_df <- load_tidy_iea_df()
  Tidy_IEA_df %>% dplyr::select(Year, Flow, Product, Edot)
#> # A tibble: 403 × 4
#>     Year Flow       Product                             Edot
#>    <dbl> <chr>      <chr>                              <dbl>
#>  1  1971 Production Primary solid biofuels           87400. 
#>  2  2000 Production Primary solid biofuels          162909. 
#>  3  1971 Production Hydro                            10472. 
#>  4  2000 Production Hydro                            23792. 
#>  5  1971 Imports    Crude oil                        38360. 
#>  6  2000 Imports    Crude oil                        54769. 
#>  7  2000 Imports    Liquefied petroleum gases (LPG)   1655. 
#>  8  2000 Imports    Motor gasoline excl. biofuels    18010. 
#>  9  1971 Imports    Aviation gasoline                   44.8
#> 10  2000 Imports    Aviation gasoline                  224. 
#> # ℹ 393 more rows
  # Move the first row to the bottom to put everything out of order
  Unsorted <- Tidy_IEA_df[-1, ] %>% dplyr::bind_rows(Tidy_IEA_df[1, ])
  Unsorted %>% dplyr::select(Year, Flow, Product, Edot)
#> # A tibble: 403 × 4
#>     Year Flow       Product                                   Edot
#>    <dbl> <chr>      <chr>                                    <dbl>
#>  1  2000 Production Primary solid biofuels                162909. 
#>  2  1971 Production Hydro                                  10472. 
#>  3  2000 Production Hydro                                  23792. 
#>  4  1971 Imports    Crude oil                              38360. 
#>  5  2000 Imports    Crude oil                              54769. 
#>  6  2000 Imports    Liquefied petroleum gases (LPG)         1655. 
#>  7  2000 Imports    Motor gasoline excl. biofuels          18010. 
#>  8  1971 Imports    Aviation gasoline                         44.8
#>  9  2000 Imports    Aviation gasoline                        224. 
#> 10  1971 Imports    Kerosene type jet fuel excl. biofuels    892. 
#> # ℹ 393 more rows
  # Now sort it to put everything right again
  Sorted <- sort_iea_df(Unsorted)
  Sorted %>% dplyr::select(Year, Flow, Product, Edot)
#> # A tibble: 403 × 4
#>     Year Flow       Product                                  Edot
#>    <dbl> <chr>      <chr>                                   <dbl>
#>  1  1971 Production Primary solid biofuels                87400. 
#>  2  1971 Production Hydro                                 10472. 
#>  3  1971 Imports    Crude oil                             38360. 
#>  4  1971 Imports    Aviation gasoline                        44.8
#>  5  1971 Imports    Kerosene type jet fuel excl. biofuels   892. 
#>  6  1971 Imports    Other kerosene                           43.8
#>  7  1971 Imports    Lubricants                              756. 
#>  8  1971 Exports    Motor gasoline excl. biofuels          -179. 
#>  9  1971 Exports    Other kerosene                         -175. 
#> 10  1971 Exports    Fuel oil                              -7437. 
#> # ℹ 393 more rows

sort_iea_df() also works with wide data frames in which years are spread to the right.

  # Pivoting the data results in unconventional ordering
  Unsorted_wide <- Tidy_IEA_df %>% tidyr::pivot_wider(names_from = Year, values_from = Edot)
  Unsorted_wide %>% dplyr::select(Flow, Product, `1971`, `2000`)
#> # A tibble: 294 × 4
#>    Flow       Product                                `1971`  `2000`
#>    <chr>      <chr>                                   <dbl>   <dbl>
#>  1 Production Primary solid biofuels                87400.  162909.
#>  2 Production Hydro                                 10472.   23792.
#>  3 Imports    Crude oil                             38360.   54769.
#>  4 Imports    Liquefied petroleum gases (LPG)          NA     1655.
#>  5 Imports    Motor gasoline excl. biofuels            NA    18010.
#>  6 Imports    Aviation gasoline                        44.8    224.
#>  7 Imports    Kerosene type jet fuel excl. biofuels   892.    1338.
#>  8 Imports    Other kerosene                           43.8     NA 
#>  9 Imports    Gas/diesel oil excl. biofuels            NA    17104.
#> 10 Imports    Lubricants                              756.    1218.
#> # ℹ 284 more rows
  # The wide data frame is not sorted correctly. Sort it.
  Sorted_wide <- sort_iea_df(Unsorted_wide)
  Sorted_wide %>% dplyr::select(Flow, Product, `1971`, `2000`)
#> # A tibble: 294 × 4
#>    Flow       Product                                `1971`  `2000`
#>    <chr>      <chr>                                   <dbl>   <dbl>
#>  1 Production Primary solid biofuels                87400.  162909.
#>  2 Production Hydro                                 10472.   23792.
#>  3 Imports    Crude oil                             38360.   54769.
#>  4 Imports    Liquefied petroleum gases (LPG)          NA     1655.
#>  5 Imports    Motor gasoline excl. biofuels            NA    18010.
#>  6 Imports    Aviation gasoline                        44.8    224.
#>  7 Imports    Kerosene type jet fuel excl. biofuels   892.    1338.
#>  8 Imports    Other kerosene                           43.8     NA 
#>  9 Imports    Gas/diesel oil excl. biofuels            NA    17104.
#> 10 Imports    Lubricants                              756.    1218.
#> # ℹ 284 more rows

Conclusion

At this point, IEA extended energy balance data have been imported to R and are ready to be used. The next step could be to specify some of the IEA energy flows. See the specify vignette for details.