IEATools
Matthew Kuperus Heun
2024-12-09
IEATools.Rmd
Introduction
IEATools
is an R
package that provides
functions to analyze extended energy balance data from the International Energy Agency (IEA). The
IEA’s data are not free. But if you have access to the data, you can
follow the instructions in this vignette to analyze it.
Getting started
To get started, data must be obtained and pulled into an R data frame.
Obtain IEA data
Purchase the IEA extended energy balances product from (https://data.iea.org). Download the complete extended energy balance data for at least one country in a .csv file format as shown in the following figure.
Example data from two countries, Ghana (GHA) and South Africa (ZAF),
for two years, 1971 and 2000, are provided in the IEATools
package.
library(dplyr)
library(IEATools)
# Define the file location
IEA_path <- file.path("extdata", "GH-ZA-ktoe-Extended-Energy-Balances-sample.csv") %>%
system.file(package = "IEATools")
# Or use the function
IEA_path <- sample_iea_data_path()
readChar(IEA_path, nchars = 256)
#> [1] "COUNTRY,FLOW,PRODUCT,1971,2000\nGhana,Production,Hard coal (if no detail),0,x\nGhana,Production,Brown coal (if no detail),0,x\nGhana,Production,Anthracite,..,0\nGhana,Production,Coking coal,..,0\nGhana,Production,Other bituminous coal,..,0\nGhana,Production,Sub-"
Check the integrity of the IEA data file
(iea_file_OK()
)
When starting to work with an IEA data file, it is important to
verify its integrity. iea_file_OK()
performs This function
performs some validation tests on its argument. Specifically, this
function confirms that every country has the same FLOW
and
PRODUCT
rows in the same order. Note that the IEA data file
is read internally with data.table::fread()
without stripping white space. Also, if the IEA data file
passes the integrity checks, a data frame representing the file (without
modification) is returned.
IEA_path %>%
iea_file_OK() %>%
glimpse()
#> logi TRUE
Convert the IEA data into an R data frame
(iea_df()
)
Extended energy balance data can be pulled directly into an
R
data frame by the iea_df()
function.
iea_df()
fixes the header shown in the figure above,
converting it to a single row. Furthermore, iea_df()
deals
with the IEA’s indicators for not applicable values (“x
”),
for unavailable values (“..
”), and for confidential values
(“c
”). (See “World Energy Balances: Database Documentation
(2018 edition)” at (http://wds.iea.org/wds/pdf/worldbal_documentation.pdf).)
R
has three concepts that could be used for
“x
” and “..
”: 0
would indicate
value known to be zero. NULL
would indicate an undefined
value. NA
would indicate a value that is unavailable.
In theory, mapping from the IEA’s indicators to R
should
be done as follows: “..
” (unavailable) and “c
”
(confidential) would be converted to NA
in R
.
“x
” (not applicable) would be converted to 0
in R
. “NULL
” would not be used. However, the
IEA are not consistent with their coding. In some places
“..
” (unavailable) is used for not applicable values,
e.g. World Anthracite supply in 1971. (World Anthracite supply in 1971
is actually not applicable (“x
”), because Anthracite was
classified under “Hard coal (if no detail)” in 1971.) On the other hand,
“..
” is used correctly for data in the most recent year
when those data have not yet been incorporated into the database.
In the face of IEA’s inconsistencies, the only rational way to
proceed is to convert “x
”, “..
”, and
“c
” to “0
”. iea_df()
performs
that task.
IEA_data <- IEA_path %>%
iea_df()
IEA_data %>%
glimpse()
#> Rows: 14,688
#> Columns: 5
#> $ COUNTRY <chr> "Ghana", "Ghana", "Ghana", "Ghana", "Ghana", "Ghana", "Ghana",…
#> $ FLOW <chr> "Production", "Production", "Production", "Production", "Produ…
#> $ PRODUCT <chr> "Hard coal (if no detail)", "Brown coal (if no detail)", "Anth…
#> $ `1971` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ `2000` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Next steps
There are several reasons why the IEA extended energy balance data are not convenient for immediate use.
- Column titles are SHOUTING.
- The
COUNTRY
column contains full (long) country names. It would be nicer if countries were identified by their 2- or 3-letter ISO codes. - The demarcation between the supply and consumption sides of the ledger is unclear to new users.
- How data should be aggregated is unclear to new users.
- The data are not in tidy format.
The IEATools
package has functions to address each of
these issues.
Fix column titles (rename_iea_df_cols()
)
To unshout the column titles, use the
rename_iea_df_cols()
function.
IEA_data %>%
rename_iea_df_cols() %>%
glimpse()
#> Rows: 14,688
#> Columns: 5
#> $ Country <chr> "Ghana", "Ghana", "Ghana", "Ghana", "Ghana", "Ghana", "Ghana",…
#> $ Flow <chr> "Production", "Production", "Production", "Production", "Produ…
#> $ Product <chr> "Hard coal (if no detail)", "Brown coal (if no detail)", "Anth…
#> $ `1971` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ `2000` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Note that the example above uses the
rename_iea_df_cols()
function without arguments, because
the default arguments are correct for the extended energy balance data
as it arrives from the IEA. However, both the old and new names can be
specified as arguments. If you despise both capitals letters and vowels,
you could do the following.
IEA_data %>%
rename_iea_df_cols(new_country = "cntry", new_flow = "flw", new_product = "prdct") %>%
glimpse()
#> Rows: 14,688
#> Columns: 5
#> $ cntry <chr> "Ghana", "Ghana", "Ghana", "Ghana", "Ghana", "Ghana", "Ghana", …
#> $ flw <chr> "Production", "Production", "Production", "Production", "Produc…
#> $ prdct <chr> "Hard coal (if no detail)", "Brown coal (if no detail)", "Anthr…
#> $ `1971` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ `2000` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
The default arguments are consistent throughout the
IEATools
package. Thus, it is recommended that you use the
default argument values, wherever possible.
Change to 3-letter ISO country abbreviations
(use_iso_countries()
)
To reduce the length of character strings representing countries, the
use_iso_countries()
function replaces country character
strings with each country’s 3-letter abbreviation.
IEA_data %>%
rename_iea_df_cols() %>%
use_iso_countries() %>%
glimpse()
#> Rows: 14,688
#> Columns: 5
#> $ Country <chr> "GHA", "GHA", "GHA", "GHA", "GHA", "GHA", "GHA", "GHA", "GHA",…
#> $ Flow <chr> "Production", "Production", "Production", "Production", "Produ…
#> $ Product <chr> "Hard coal (if no detail)", "Brown coal (if no detail)", "Anth…
#> $ `1971` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ `2000` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Remove aggregation and memo data
(remove_agg_memo_flows()
)
The IEA’s extended energy balances contain many memos and
aggregations in the data itself. For the purposes of manipulations and
calculations, it is often advisable to remove all memos and
aggregations. The remove_agg_memo_flows()
function performs
that task.
IEA_data %>%
rename_iea_df_cols() %>%
filter(Flow == "Total primary energy supply") %>%
glimpse()
#> Rows: 0
#> Columns: 5
#> $ Country <chr>
#> $ Flow <chr>
#> $ Product <chr>
#> $ `1971` <dbl>
#> $ `2000` <dbl>
# Total primary energy supply is an aggregation row,
# so its rows should be absent after calling remove_agg_memo_flows().
IEA_data %>%
rename_iea_df_cols() %>%
remove_agg_memo_flows() %>%
filter(Flow == "Total primary energy supply")
#> [1] Country Flow Product 1971 2000
#> <0 rows> (or 0-length row.names)
Augment with ledger side and aggregation point information
(augment_iea_df()
)
The IEA’s extended energy balance data can be confusing for first-time users to understand. In particular, both
- the demarcation between the supply and consumption sides of the ledger and
- how data should be aggregated
are unclear.
To clarify these issues and make later aggregation calculations
possible, call the augment_iea_data()
function.
augment_iea_df()
has many default arguments that work fine
for as-delivered IEA extended energy balance data in kilotons of oil
equivalent units. augment_iea_df()
adds new columns
Ledger.side
, Flow.aggregation.point
,
Energy.type
, and Unit
.
IEA_data %>%
rename_iea_df_cols() %>%
augment_iea_df() %>%
glimpse()
#> Rows: 14,688
#> Columns: 11
#> $ Country <chr> "Ghana", "Ghana", "Ghana", "Ghana", "Ghana", "Gha…
#> $ Method <chr> "PCM", "PCM", "PCM", "PCM", "PCM", "PCM", "PCM", …
#> $ EnergyType <chr> "E", "E", "E", "E", "E", "E", "E", "E", "E", "E",…
#> $ LastStage <chr> "Final", "Final", "Final", "Final", "Final", "Fin…
#> $ LedgerSide <chr> "Supply", "Supply", "Supply", "Supply", "Supply",…
#> $ FlowAggregationPoint <chr> "Total primary energy supply", "Total primary ene…
#> $ Flow <chr> "Production", "Production", "Production", "Produc…
#> $ Product <chr> "Hard coal (if no detail)", "Brown coal (if no de…
#> $ Unit <chr> "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "…
#> $ `1971` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ `2000` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Convert to a tidy data frame (tidy_iea_df()
)
The tidy format
is a data frame where each datum is located on its own row and columns
provide metadata for each datum. As delivered, the IEA’s data are not
tidy: years are spread to the right rather than being a single column.
The tidy_iea_df()
function converts to a tidy format. By
default, tidy_iea_df()
removes zeroes from the data frame,
thereby reducing memory footprint.
Tidy_IEA_df <- IEA_data %>%
rename_iea_df_cols() %>%
use_iso_countries() %>%
remove_agg_memo_flows() %>%
augment_iea_df() %>%
tidy_iea_df()
Tidy_IEA_df %>%
glimpse()
#> Rows: 403
#> Columns: 11
#> $ Country <chr> "GHA", "GHA", "GHA", "GHA", "GHA", "GHA", "GHA", …
#> $ Method <chr> "PCM", "PCM", "PCM", "PCM", "PCM", "PCM", "PCM", …
#> $ EnergyType <chr> "E", "E", "E", "E", "E", "E", "E", "E", "E", "E",…
#> $ LastStage <chr> "Final", "Final", "Final", "Final", "Final", "Fin…
#> $ Year <dbl> 1971, 2000, 1971, 2000, 1971, 2000, 2000, 2000, 1…
#> $ LedgerSide <chr> "Supply", "Supply", "Supply", "Supply", "Supply",…
#> $ FlowAggregationPoint <chr> "Total primary energy supply", "Total primary ene…
#> $ Flow <chr> "Production", "Production", "Production", "Produc…
#> $ Product <chr> "Primary solid biofuels", "Primary solid biofuels…
#> $ Unit <chr> "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "…
#> $ Edot <dbl> 87399.9985, 162908.9993, 10472.4010, 23792.3995, …
The load_tidy_iea_df()
function
For simplicity, all steps above are rolled into
load_tidy_iea_df()
. By default,
load_tidy_iea_df()
loads the sample data bundled with the
package and converts it to a tidy IEA data frame, assuming default
arguments for all functions. But you can supply the path to any IEA data
file in the .iea_file
argument to
load_tidy_iea_df()
.
Simple <- load_tidy_iea_df()
Complicated <- sample_iea_data_path() %>%
iea_df() %>%
rename_iea_df_cols() %>%
remove_agg_memo_flows() %>%
use_iso_countries() %>%
augment_iea_df() %>%
tidy_iea_df()
all(Simple == Complicated)
#> [1] TRUE
At this point, the integrity of the IEA data can be checked.
Check energy balances
The IEA extended energy balance data are usually close to balanced,
but not quite. To check the integrity of the energy balances, use the
calc_tidy_iea_df_balances()
function.
calc_tidy_iea_df_balances()
adds several new columns to the
tidy IEA data frame, including:
-
supply_sum
(the sum of all flows on the supply side of the ledger), -
consumption_sum
(the sum of all flows on the consumption side of the ledger), -
supply_minus_consumption
(the difference between supply and consumption), -
balance_OK
(a logical telling whether the energy balance is acceptable), and -
err
(supply sum whenconsumption_sum
isNA
orsupply_minus_consumption
whenconsumption_sum
is notNA
).
Balances <- Tidy_IEA_df %>%
calc_tidy_iea_df_balances()
Balances %>%
glimpse()
#> Rows: 78
#> Columns: 12
#> $ Country <chr> "GHA", "GHA", "GHA", "GHA", "GHA", "GHA", "GH…
#> $ Method <chr> "PCM", "PCM", "PCM", "PCM", "PCM", "PCM", "PC…
#> $ EnergyType <chr> "E", "E", "E", "E", "E", "E", "E", "E", "E", …
#> $ LastStage <chr> "Final", "Final", "Final", "Final", "Final", …
#> $ Year <dbl> 1971, 1971, 1971, 1971, 1971, 1971, 1971, 197…
#> $ Product <chr> "Aviation gasoline", "Charcoal", "Crude oil",…
#> $ Unit <chr> "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "TJ…
#> $ supply_sum <dbl> 0.0000, 4989.5980, -0.0001, 9845.9971, 3979.8…
#> $ consumption_sum <dbl> NA, 4989.5980, NA, 9846.0055, 3979.8004, 9136…
#> $ supply_minus_consumption <dbl> NA, 0.000000e+00, NA, -8.400000e-03, -1.00000…
#> $ balance_OK <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE,…
#> $ err <dbl> 0.000000e+00, 0.000000e+00, -1.000000e-04, -8…
Balances %>%
tidy_iea_df_balanced()
#> [1] FALSE
Notice that the IEA data are not quite energy balanced. To fix the
energy balance, use the fix_tidy_iea_df_balances()
function. fix_tidy_iea_df_balances()
adjusts the
Statistical differences
flow to achieve perfect energy
balance.
# Fix product-level balances within each country
Tidy_IEA_df %>%
fix_tidy_iea_df_balances() %>%
calc_tidy_iea_df_balances() %>%
glimpse()
#> Rows: 78
#> Columns: 12
#> $ Country <chr> "GHA", "GHA", "GHA", "GHA", "GHA", "GHA", "GH…
#> $ Method <chr> "PCM", "PCM", "PCM", "PCM", "PCM", "PCM", "PC…
#> $ EnergyType <chr> "E", "E", "E", "E", "E", "E", "E", "E", "E", …
#> $ LastStage <chr> "Final", "Final", "Final", "Final", "Final", …
#> $ Year <dbl> 1971, 1971, 1971, 1971, 1971, 1971, 1971, 197…
#> $ Product <chr> "Aviation gasoline", "Charcoal", "Crude oil",…
#> $ Unit <chr> "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "TJ", "TJ…
#> $ supply_sum <dbl> 0.0000, 4989.5980, 0.0000, 9846.0055, 3979.80…
#> $ consumption_sum <dbl> NA, 4989.5980, NA, 9846.0055, 3979.8004, 9136…
#> $ supply_minus_consumption <dbl> NA, 0, NA, 0, 0, 0, NA, NA, 0, 0, 0, 0, 0, NA…
#> $ balance_OK <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
#> $ err <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Sorting
When manipulating IEA data frames, rows may attain different sorting
relative to the standard IEA order. The sort_iea_df()
function can return a data frame to IEA sort order.
Tidy_IEA_df <- load_tidy_iea_df()
Tidy_IEA_df %>% dplyr::select(Year, Flow, Product, Edot)
#> # A tibble: 403 × 4
#> Year Flow Product Edot
#> <dbl> <chr> <chr> <dbl>
#> 1 1971 Production Primary solid biofuels 87400.
#> 2 2000 Production Primary solid biofuels 162909.
#> 3 1971 Production Hydro 10472.
#> 4 2000 Production Hydro 23792.
#> 5 1971 Imports Crude oil 38360.
#> 6 2000 Imports Crude oil 54769.
#> 7 2000 Imports Liquefied petroleum gases (LPG) 1655.
#> 8 2000 Imports Motor gasoline excl. biofuels 18010.
#> 9 1971 Imports Aviation gasoline 44.8
#> 10 2000 Imports Aviation gasoline 224.
#> # ℹ 393 more rows
# Move the first row to the bottom to put everything out of order
Unsorted <- Tidy_IEA_df[-1, ] %>% dplyr::bind_rows(Tidy_IEA_df[1, ])
Unsorted %>% dplyr::select(Year, Flow, Product, Edot)
#> # A tibble: 403 × 4
#> Year Flow Product Edot
#> <dbl> <chr> <chr> <dbl>
#> 1 2000 Production Primary solid biofuels 162909.
#> 2 1971 Production Hydro 10472.
#> 3 2000 Production Hydro 23792.
#> 4 1971 Imports Crude oil 38360.
#> 5 2000 Imports Crude oil 54769.
#> 6 2000 Imports Liquefied petroleum gases (LPG) 1655.
#> 7 2000 Imports Motor gasoline excl. biofuels 18010.
#> 8 1971 Imports Aviation gasoline 44.8
#> 9 2000 Imports Aviation gasoline 224.
#> 10 1971 Imports Kerosene type jet fuel excl. biofuels 892.
#> # ℹ 393 more rows
# Now sort it to put everything right again
Sorted <- sort_iea_df(Unsorted)
Sorted %>% dplyr::select(Year, Flow, Product, Edot)
#> # A tibble: 403 × 4
#> Year Flow Product Edot
#> <dbl> <chr> <chr> <dbl>
#> 1 1971 Production Primary solid biofuels 87400.
#> 2 1971 Production Hydro 10472.
#> 3 1971 Imports Crude oil 38360.
#> 4 1971 Imports Aviation gasoline 44.8
#> 5 1971 Imports Kerosene type jet fuel excl. biofuels 892.
#> 6 1971 Imports Other kerosene 43.8
#> 7 1971 Imports Lubricants 756.
#> 8 1971 Exports Motor gasoline excl. biofuels -179.
#> 9 1971 Exports Other kerosene -175.
#> 10 1971 Exports Fuel oil -7437.
#> # ℹ 393 more rows
sort_iea_df()
also works with wide data frames in which
years are spread to the right.
# Pivoting the data results in unconventional ordering
Unsorted_wide <- Tidy_IEA_df %>% tidyr::pivot_wider(names_from = Year, values_from = Edot)
Unsorted_wide %>% dplyr::select(Flow, Product, `1971`, `2000`)
#> # A tibble: 294 × 4
#> Flow Product `1971` `2000`
#> <chr> <chr> <dbl> <dbl>
#> 1 Production Primary solid biofuels 87400. 162909.
#> 2 Production Hydro 10472. 23792.
#> 3 Imports Crude oil 38360. 54769.
#> 4 Imports Liquefied petroleum gases (LPG) NA 1655.
#> 5 Imports Motor gasoline excl. biofuels NA 18010.
#> 6 Imports Aviation gasoline 44.8 224.
#> 7 Imports Kerosene type jet fuel excl. biofuels 892. 1338.
#> 8 Imports Other kerosene 43.8 NA
#> 9 Imports Gas/diesel oil excl. biofuels NA 17104.
#> 10 Imports Lubricants 756. 1218.
#> # ℹ 284 more rows
# The wide data frame is not sorted correctly. Sort it.
Sorted_wide <- sort_iea_df(Unsorted_wide)
Sorted_wide %>% dplyr::select(Flow, Product, `1971`, `2000`)
#> # A tibble: 294 × 4
#> Flow Product `1971` `2000`
#> <chr> <chr> <dbl> <dbl>
#> 1 Production Primary solid biofuels 87400. 162909.
#> 2 Production Hydro 10472. 23792.
#> 3 Imports Crude oil 38360. 54769.
#> 4 Imports Liquefied petroleum gases (LPG) NA 1655.
#> 5 Imports Motor gasoline excl. biofuels NA 18010.
#> 6 Imports Aviation gasoline 44.8 224.
#> 7 Imports Kerosene type jet fuel excl. biofuels 892. 1338.
#> 8 Imports Other kerosene 43.8 NA
#> 9 Imports Gas/diesel oil excl. biofuels NA 17104.
#> 10 Imports Lubricants 756. 1218.
#> # ℹ 284 more rows
Conclusion
At this point, IEA extended energy balance data have been imported to
R
and are ready to be used. The next step could be to
specify some of the IEA energy flows. See the specify
vignette for details.