Load IEA data from an extended energy balances .csv file
iea_df.Rd
If .slurped_iea_df
is supplied, arguments .iea_file
or text
are ignored.
If .slurped_iea_df
is absent,
either .iea_file
or text
are required, and
the helper function slurp_iea_to_raw_df()
is called internally
to load a raw data frame of data.
Usage
iea_df(
.iea_file = NULL,
text = NULL,
expected_1st_line_start = ",,TIME",
expected_2nd_line_start = "COUNTRY,FLOW,PRODUCT",
expected_simple_start = expected_2nd_line_start,
.slurped_iea_df = NULL,
flow = "FLOW",
missing_data = "..",
not_applicable_data = "x",
confidential_data = "c",
estimated_year = "E"
)
Arguments
- .iea_file
A string containing the path to a .csv file of extended energy balances from the IEA. Can be a vector of file paths, in which case each file is loaded sequentially and stacked together with
dplyr::bind_rows()
. Default is the path to a sample IEA file provided in this package.- text
A character string that can be parsed as IEA extended energy balances. (This argument is useful for testing.)
- expected_1st_line_start
the expected start of the first line of
iea_file
. Default is ",,TIME".- expected_2nd_line_start
the expected start of the second line of
iea_file
. Default is "COUNTRY,FLOW,PRODUCT".- expected_simple_start
the expected starting of the first line of
iea_file
. Default is the value ofexpected_2nd_line_start
. Note thatexpected_simple_start
is sometimes encountered in data supplied by the IEA. Furthermore,expected_simple_start
could be the format of the file when somebody "helpfully" fiddles with the raw data from the IEA.- .slurped_iea_df
a data frame created by
slurp_iea_to_raw_df()
- flow
the name of the flow column, entries of which are stripped of leading and trailing white space. Default is "FLOW".
- missing_data
a string that identifies missing data. Default is "
..
". Entries ofmissing_data
are coded as `0“ in output.- not_applicable_data
a string that identifies not-applicable data. Default is "x". Entries of
not_applicable_data
are coded as0
in output.- confidential_data
a string that identifies confidential data. Default is "c". Entries of
confidential_data
are coded as0
in output.- estimated_year
a string that identifies an estimated year. Default is "E". E.g., in "2014E", the "E" indicates that data for 2014 are estimated. Data from estimated years are removed from output.
Details
Next, this function does some cleaning of the data.
In the IEA's data, some entries in the "FLOW" column are quoted to avoid creating too many columns.
For example, "Paper, pulp and printing" is quoted in the raw .csv file:
" Paper, pulp and printing".
Internally, this function uses data.table::fread()
, which, unfortunately, does not
strip leading and trailing white space from quoted entries.
So the function uses base::trimws()
to finish the job.
When the IEA includes estimated data for a year, the column name of the estimated year includes an "E" appended. (E.g., "2017E".) This function eliminates estimated columns.
The IEA data have indicators for
not applicable values ("x
") and for
unavailable values ("..
").
(See "World Energy Balances: Database Documentation (2018 edition)" at
http://wds.iea.org/wds/pdf/worldbal_documentation.pdf.)
R
has three concepts that could be used for "x
" and "..
":
0
would indicate value known to be zero.
NULL
would indicate an undefined value.
NA
would indicate a value that is unavailable.
In theory, mapping from the IEA's indicators to R
should proceed as follows:
"..
" (unavailable) in the IEA data would be converted to NA
in R
.
"x
" (not applicable) in the IEA data would be converted to 0
in R
.
"NULL
" would not be used.
However, the IEA are not consistent with their coding.
In some places "..
" (indicating unavailable) is used for not applicable values,
e.g., World Anthracite supply in 1971.
(World Anthracite supply in 1971 is actually not applicable, because Anthracite was
classified under "Hard coal (if no detail)" in 1971.)
On the other hand, "..
" is used for data in the most recent year
when those data have not yet been incorporated into the database.
In the face of IEA's inconsistencies,
the only rational way to proceed is to convert
both "x
" and "..
" in the IEA files to "0
" in the output data frame
from this function.
Furthermore, confidential data (coded by the IEA as "c
") is also interpreted as 0
.
(What else can we do?)
The data frame returned from this function is not ready to be used in R,
because rows are not unique.
To further prepare the data frame for use, call augment_iea_df()
,
passing the output of this function to the .iea_df
argument of augment_iea_df()
.
This function is vectorized over .iea_file
.
Examples
# Original file format
iea_df(text = paste0(",,TIME,1960,1961\n",
"COUNTRY,FLOW,PRODUCT\n",
"World,Production,Hard coal (if no detail),42,43"))
#> COUNTRY FLOW PRODUCT 1960 1961
#> 1 World Production Hard coal (if no detail) 42 43
# With extra commas on the 2nd line
iea_df(text = paste0(",,TIME,1960,1961\n",
"COUNTRY,FLOW,PRODUCT,,,\n",
"World,Production,Hard coal (if no detail),42,43"))
#> COUNTRY FLOW PRODUCT 1960 1961
#> 1 World Production Hard coal (if no detail) 42 43
# With a clean first line
iea_df(text = paste0("COUNTRY,FLOW,PRODUCT,1960,1961\n",
"World,Production,Hard coal (if no detail),42,43"))
#> COUNTRY FLOW PRODUCT 1960 1961
#> 1 World Production Hard coal (if no detail) 42 43