Chapter 3 R Data Modeling

Your project team is assigned to work with the accounts receivable team. Specifically, accounts receivable is about to review a portfolio of customers from a potential acquisition. Managers would like to query data provided through due diligence during the acquisition process. Some of the questions include:

  1. What is the income risk across applicant pools?
  2. Are there differences in applicant income?
  3. Does age matter?
  4. Is there a pattern of family dependents across applicant pools?
  5. How much income per dependent?

The analytics team will process the data, review its quality, and help accounts receivable answer these questions.

In this chapter we will build approaches to manage such queries, including pivot tables, lookups, and the creation of new metrics from existing data. We will expand this assortment of skills into the writing of functions, such as net present value and internal rate of return, more plotting of data, working with time series data, and fitting data to probability distributions.

3.2 Pivot tables and Vertical Lookups

These are, mythically at least, two of the most-used Excel features. Pivot tables are the slice and dice machine we use to partition data sets. Lookups allow us to relate one data table to another. We will explore these tools in R , here made easier and less apt to crash on large data sets. We start with some definitions.

3.2.1 Some definitions

The pivot table is a data summarization tool that can automatically sort, count, total, or give the average of the data stored in one table or spreadsheet, displaying the results in a second table showing the summarized data. This tool transforms a flat table of fields with rows of data into a table with grouped row values and column header values. The specification of grouped row values and column headers can rotate the flat table’s data rows into the intersection of the row and column labels.

“V” or “vertical” stands for the looking up of a value in a column. This feature allows the analyst to find approximate and exact matches between the look up value and a table value in a vertical column assigned to the look up value. A HLOOKUP function does the same lookup but for a specified row instead of a column.

3.2.2 Pivot and Parry

Let’s return to the Credit Card Applicant business questions:

  1. What is the income risk across applicant pools?
  2. Are there differences in applicant income?
  3. Does age matter?
  4. Is there a pattern of dependents across applicant pools?
  5. How much income per dependent?

The first step in building an analysis of the data relative to these questions is to understand the required dimensions of the data that apply to the questions. Here we would scan the table column names in the data base and look for

  1. Card status
  2. Ownership
  3. Employment
CreditCard read.csv("data/CreditCard.csv") str(CreditCard)
## 'data.frame': 1319 obs. of 13 variables: ## $ card : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 . ## $ reports : int 0 0 0 0 0 0 0 0 0 0 . ## $ age : num 37.7 33.2 33.7 30.5 32.2 . ## $ income : num 4.52 2.42 4.5 2.54 9.79 . ## $ share : num 0.03327 0.00522 0.00416 0.06521 0.06705 . ## $ expenditure: num 124.98 9.85 15 137.87 546.5 . ## $ owner : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 2 1 . ## $ selfemp : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 . ## $ dependents : int 3 3 4 0 2 0 2 0 0 0 . ## $ months : int 54 34 58 25 64 54 7 77 97 65 . ## $ majorcards : int 1 1 1 1 1 1 1 1 1 1 . ## $ active : int 12 13 5 7 5 1 5 3 6 18 . ## $ state : Factor w/ 3 levels "CT","NJ","NY": 3 3 3 3 3 3 3 3 3 3 . 

The str() function allows us to see all of the objects in CreditCard . Next lext look at the data itself inside this object using head (for the beginning of the data).

head(CreditCard, 3)
## card reports age income share expenditure owner selfemp ## 1 yes 0 37.66667 4.52 0.033269910 124.983300 yes no ## 2 yes 0 33.25000 2.42 0.005216942 9.854167 no no ## 3 yes 0 33.66667 4.50 0.004155556 15.000000 yes no ## dependents months majorcards active state ## 1 3 54 1 12 NY ## 2 3 34 1 13 NY ## 3 4 58 1 5 NY

Knowing the structure and a sample of the data, we can build a summary of the data and review the minimum, maximum, and quartiles in each of CreditCard ’s columns of data.

summary(CreditCard)
## card reports age income ## no : 296 Min. : 0.0000 Min. : 0.1667 Min. : 0.210 ## yes:1023 1st Qu.: 0.0000 1st Qu.:25.4167 1st Qu.: 2.244 ## Median : 0.0000 Median :31.2500 Median : 2.900 ## Mean : 0.4564 Mean :33.2131 Mean : 3.365 ## 3rd Qu.: 0.0000 3rd Qu.:39.4167 3rd Qu.: 4.000 ## Max. :14.0000 Max. :83.5000 Max. :13.500 ## share expenditure owner selfemp ## Min. :0.0001091 Min. : 0.000 no :738 no :1228 ## 1st Qu.:0.0023159 1st Qu.: 4.583 yes:581 yes: 91 ## Median :0.0388272 Median : 101.298 ## Mean :0.0687322 Mean : 185.057 ## 3rd Qu.:0.0936168 3rd Qu.: 249.036 ## Max. :0.9063205 Max. :3099.505 ## dependents months majorcards active ## Min. :0.0000 Min. : 0.00 Min. :0.0000 Min. : 0.000 ## 1st Qu.:0.0000 1st Qu.: 12.00 1st Qu.:1.0000 1st Qu.: 2.000 ## Median :1.0000 Median : 30.00 Median :1.0000 Median : 6.000 ## Mean :0.9939 Mean : 55.27 Mean :0.8173 Mean : 6.997 ## 3rd Qu.:2.0000 3rd Qu.: 72.00 3rd Qu.:1.0000 3rd Qu.:11.000 ## Max. :6.0000 Max. :540.00 Max. :1.0000 Max. :46.000 ## state ## CT:442 ## NJ:472 ## NY:405 ## ## ## 

We immediately see an age minimum of 0.2. Either this is an anomaly, or outright error, or there is an application not quite a year old!. Let’s filter the data for ages greater than 18 to be safe.

ccard CreditCard[CreditCard$age >= 18, ]

In the filter, the comma means keep data on applicants only at or in excess of 18 years of age. When we leave the column empty, it means apply this filter across all columns. We next review the distribution of ages of applicants to be sure our filter does the job properly. The function hist() builds a simple frequency histogram to visualize this data.

hist(ccard$age)

3.2.3 Try this exercise

What is the basic design of this inquiry?

  1. Business questions?
  2. Dimensions?
  3. Taxonomy and metrics?

To answer 1 and 2 we have business questions along the lines of indicator variables:

For 3 our basic taxonomy is:

  1. For each card issued…in New York
  2. …and for each owner…
  3. …who is employed…
  4. What are the range of income, average dependents, age, and income per dependent?

Here is the basic 3 step pivot table design. We should check if we have installed the dplyr package into the R environment.

# install.packages('dplyr') if not # already require(dplyr) ## 1: filter to keep three states. pivot.table filter(ccard, state %in% "NY") ## 2: set up data frame for by-group ## processing. pivot.table group_by(pivot.table, card, owner, selfemp) ## 3: calculate the three summary ## metrics options(dplyr.width = Inf) ## to display all columns pivot.table summarise(pivot.table, income.cv = sd(income)/mean(income), age.avg = mean(age), income.per.dependent = sum(income)/sum(dependents))

We then visualize results in a table. Here we use knitr , which is a package that powers rmarkdown . The function kable() is short for “knitr table.”

knitr::kable(pivot.table)
card owner selfemp income.cv age.avg income.per.dependent
no no no 0.4941859 31.91936 3.645848
no no yes 0.5652634 26.38542 2.852000
no yes no 0.3756274 36.01786 2.157589
no yes yes NaN 53.33333 Inf
yes no no 0.3298633 28.09311 5.313677
yes no yes 0.4367858 37.45238 7.062500
yes yes no 0.5519888 36.79503 3.154476
yes yes yes 0.5032180 41.91667 3.194547

3.2.4 Now to VLOOKUP

Let’s start with a different data set. We load this IBRD (World Bank) data that has

le read.csv("data/life_expectancy.csv", header = TRUE, stringsAsFactors = FALSE) sa read.csv("data/sanitation_.csv", header = TRUE, stringsAsFactors = FALSE)

Always we look at the first few records.

head(le)
## country years.life.expectancy.avg ## 1 Afghanistan 46.62135 ## 2 Albania 71.11807 ## 3 Algeria 61.81652 ## 4 Angola 41.65847 ## 5 Antigua and Barbuda 69.81219 ## 6 Arab World 60.93432
head(sa)
## country sanitation.avg ## 1 Afghanistan 25.39600 ## 2 Albania 85.36154 ## 3 Algeria 84.21538 ## 4 American Samoa 61.73077 ## 5 Andorra 100.00000 ## 6 Angola 36.00769

The job here is to join sanitation data with life expectancy data, by country. In Excel we would typically use a VLOOKUP(country, sanitation, 2, FALSE) statement.

  1. In this statement country is the value to be looked up, for example, “Australia”.
  2. The variable sanitation is the range of the sanitation lookup table of two columns of country and sanitation data, for example, B2:C104 in Excel.
  3. The 2 is the second column of the sanitation lookup table, for example column C .
  4. FALSE means don’t find an exact match.

In R we can use the merge() function.

life.sanitation merge(le[, c("country", "years.life.expectancy.avg")], sa[, c("country", "sanitation.avg")])

The whole range of countries is populated by the lookup.

head(life.sanitation, 3)
## country years.life.expectancy.avg sanitation.avg ## 1 Afghanistan 46.62135 25.39600 ## 2 Albania 71.11807 85.36154 ## 3 Algeria 61.81652 84.21538

3.2.5 Try this exercise

We will load yet another data set on house prices. Suppose we work for a housing developer like Toll Brothers (NYSE: TOL) and want to allocate resources to marketing and financing the building of luxury homes in major US metropolitan areas. We have data for one test market.

hprice read.csv("data/hprice.csv")

Let’s look at the available data:

summary(hprice)
## ID Price SqFt Bedrooms ## Min. : 1.00 Min. : 69100 Min. :1450 Min. :2.000 ## 1st Qu.: 32.75 1st Qu.:111325 1st Qu.:1880 1st Qu.:3.000 ## Median : 64.50 Median :125950 Median :2000 Median :3.000 ## Mean : 64.50 Mean :130427 Mean :2001 Mean :3.023 ## 3rd Qu.: 96.25 3rd Qu.:148250 3rd Qu.:2140 3rd Qu.:3.000 ## Max. :128.00 Max. :211200 Max. :2590 Max. :5.000 ## Bathrooms Offers Brick Neighborhood ## Min. :2.000 Min. :1.000 No :86 East :45 ## 1st Qu.:2.000 1st Qu.:2.000 Yes:42 North:44 ## Median :2.000 Median :3.000 West :39 ## Mean :2.445 Mean :2.578 ## 3rd Qu.:3.000 3rd Qu.:3.000 ## Max. :4.000 Max. :6.000

Our business questions include:

  1. What are the most valuable (higher price) neighborhoods?
  2. What housing characteristics maintain the most housing value?

First, where and what are the most valuable houses? One way to answer this is to build a pivot table. Next we pivot the data and build metrics into the query. We will use the mean() and standard deviation sd() functions to help answer our questions.

require(dplyr) ## 1: filter to those houses with ## fairly high prices pivot.table filter(hprice, Price > 99999) ## 2: set up data frame for by-group ## processing pivot.table group_by(pivot.table, Brick, Neighborhood) ## 3: calculate the summary metrics options(dplyr.width = Inf) ## to display all columns pivot.table summarise(pivot.table, Price.avg = mean(Price), Price.cv = sd(Price)/mean(Price), SqFt.avg = mean(SqFt), Price.per.SqFt = mean(Price)/mean(SqFt))

Then we visualize in a table.

knitr::kable(pivot.table)
Brick Neighborhood Price.avg Price.cv SqFt.avg Price.per.SqFt
No East 121095.7 0.1251510 2019.565 59.96125
No North 115307.1 0.0939797 1958.214 58.88382
No West 148230.4 0.0912350 2073.478 71.48878
Yes East 135468.4 0.0977973 2031.053 66.69863
Yes North 118457.1 0.1308498 1857.143 63.78462
Yes West 175200.0 0.0930105 2091.250 83.77764

Based on this data set from one metropolitan area, the most valuable properties (fetching the highest average price and price per square foot) are made of brick in the West neighborhood. Brick or not, the West neighborhood also seems have the lowest relative variation in price.

Now for something different: functions.

3.3 Why Functions?

We will encapsulate several operations into a reusable storage device called a function. The usual suspects and candidates for the use of functions are:

In both cases the logic and coding is easier to understand, easier to work with, easier to build into larger things, and less prone to breaches of plain-old stubby finger breaches of operational safety and security.

For example, here is an Excel look-alike NPV function. We enter this into a code-chunk in an R markdown file or directly into the console to store the function into the current R environment. Once that is done, we now have a new function we can use like any other function.

## Net Present Value function Inputs: ## vector of rates (rates) with 0 as ## the first rate for time 0, vector ## of cash flows (cashflows) Outputs: ## scalar net present value NPV.1 function(rates, cashflows) sum(cashflows/(1 + rates)^(seq_along(cashflows) - 1)) return(NPV) >

The structure of a function has these parts:

  1. A header describes the function along with inputs and outputs. Here we use comment characters # to describe and document the function.
  2. A definition names the function and identify the interface of inputs and outputs to the programming environment. The name is like a variable and is assigned to function() , where inputs are defined.
  3. Code statements take the inputs from the definition and program the tasks, logic, and decisions in the function’s work flow into output.
  4. An output statement releases the function’s results for use in other code statements outside of the function’s “mini-verse.” We use the formal return() function to identify the output that the function will produce. If we did not use return() , then R will us the last assigned variable as the output of the function.

In this example We generate data internal to the function:

Our functions get used just like the built-in ones, for example, mean() . Let’s define rates and cashflows as vector inputs to the NPV.1() function and run this code.

rates c(0, 0.08, 0.06, 0.04) ## first rate is always 0.00 cashflows c(-100, 200, 300, 10) NPV.1(rates, cashflows)
## [1] 361.0741

We go back to the declaration and look at the parts:

## Net Present Value function Inputs: ## vector of rates (rates) with 0 as ## the first rate for time 0, vector ## of cash flows (cashflows) Outputs: ## scalar net present value NPV.1 function(rates, cashflows) sum(cashflows/(1 + rates)^(seq_along(cashflows) - 1)) return(NPV) >

Interfaces refer to these components:

We can also call other functions we’ve written. We use return() to explicitly say what the output is. This is simply good documentation. Alternately, a function will return the last evaluation.

Comments, that is, lines that begin with # , are not required by R , but are always a good and welcome idea that provide a terse description of purpose and direction. Initial comments should also include a listing of inputs, also called “arguments,” and outputs.

3.3.1 What should be a function?

Functions should be written for code we are going to re-run, especially if it will be re-run with changes in inputs. They can also be code chunks we keep highlighting and hitting return on. We often write functions for code chunkswhich are small parts of bigger analyses.

In the next redition of irr.1 we improve the code with named and default arguments.

## Internal Rate of Return (IRR) ## function Inputs: vector of cash ## flows (cashflows), scalar ## interations (maxiter) Outputs: ## scalar net present value IRR.1 function(cashflows, maxiter = 1000) seq_along(cashflows) - 1 ## rate will eventually converge to ## IRR f function(rate) (sum(cashflows/(1 + rate)^t)) ## use uniroot function to solve for ## root (IRR = rate) of f = 0 c(-1,1) ## bounds solution for only positive ## or negative rates select the root ## estimate return(uniroot(f, c(-1, 1), maxiter = maxiter)$root) >

Here the default argument is maxiter which controls the number of iterations. At our peril we can eliminate this argument if we want. This illustrates yet another need for functions: we can put error and exception logic to handle somtimes fatal issues our calculations might present.

Here are the cashflows for a 3% coupon bond bought at a hefty premium.

cashflows c(-150, 3, 3, 3, 3, 3, 3, 3, 103) IRR.1(cashflows)
## [1] -0.02554088
IRR.1(cashflows, maxiter = 100)
## [1] -0.02554088

We get a negative IRR or yield to maturity on this net present value = 0 calculation.

3.3.2 Shooting trouble

Problem: We see “odd”" behavior when arguments aren’t as we expect.

NPV.1(c(0.1, 0.05), c(-10, 5, 6, 100))
## [1] 86.10434

We do get a result, but…

Here the function calculates a net present value. But the analyst entered two rates for four cash flows.

Solution: We put sanity checks into the code.

## Net Present Value function Inputs: ## vector of rates (rates) with 0 as ## the first rate for time 0, vector ## of cash flows (cashflows), length ## of rates must equal length of ## cashflows Outputs: scalar net ## present value NPV.2 function(rates, cashflows) < stopifnot(length(rates) == length(cashflows)) NPV sum(cashflows/(1 + rates)^(seq_along(cashflows) - 1)) return(NPV) >

Here are some thoughts about stopifnot TRUE error handling

NPV.2(c(0.1, 0.05), c(-10, 5, 6, 100))

Hit (not too hard!) the Escape key on your keyboard, This will take you out of Browse[1]> mode and back to the console prompt > .

3.3.3 What the function can see and do

Each function has its own environment. Names here will override names in the global environment. The function’s internal environment starts with the named arguments. Assignments inside the function only change the internal environment. If a name is not defined in the function, the function will look for this name in the environment the function gets called from.

3.3.4 Try this …

Your company is running a 100 million pound sterling project in the EU. You must post 25% collateral in a Landesbank using only high-quality government securities. You find a high-quality gilt fund that will pay 1.5% (coupon rate) annually for three years.

Some questions for analysis

  1. How much would you pay for this collateral if the rate curve (yield to maturity of cash flows) is (from next year on…)
rates c(-0.001, 0.002, 0.01)
  1. Suppose a bond dealer asks for 130% of notional collateral value for this bond. What is the yield on this transaction (IRR)? Would you buy it?
  2. What is the return on this collateral if you terminate the project in one year and liquidate the collateral (i.e., sell it for cash) if the yield shifts down by 0.005? This is a “parallel” shift, which is finance for: “take each rate and deduct 0.005.”

To get at these requirements we will build rates and cash flows across the 3-year time frame, remembering our previous work.

(rates c(0, rates))
## [1] 0.000 -0.001 0.002 0.010
collateral.periods 3 collateral.rate 0.25 collateral.notional collateral.rate * 100 coupon.rate 0.015 cashflows rep(coupon.rate * collateral.notional, collateral.periods) cashflows[collateral.periods] collateral.notional + cashflows[collateral.periods] (cashflows c(0, cashflows))
## [1] 0.000 0.375 0.375 25.375

What just happened?

  1. We appended a 0 to the rate schedule so we can use the NPV.2 function.
  2. We then parameterized the term sheet (terms of the collateral transaction),
  3. We used rep() to form coupon cash flows.
  4. Then we added notional value repayment to the last cash flow.

Now we can calculate the present value of the bond using NPV.2 .

(Value.0 NPV.2(rates, cashflows))
## [1] 25.3776

The answer is 25.378 million pounds sterling or Value.0 / collateral.notional times the notional value.

The yield to maturity averages the forward rates across the bond cash flows. A “forward rate” is the rate per period we would expect to earn each period. This is one interpretation of the Internal Rate of Return (“IRR”).

cashflows.IRR cashflows collateral.ask 130 cashflows.IRR[1] -(collateral.ask/100) * collateral.notional ## mind the negative sign! (collateral.IRR.1 IRR.1(cashflows.IRR))
## [1] -0.07112366

You end up paying over 7% per annum for the privilege of owning this bond! You call up the European Central Bank, report this overly hefty haircut on your project. You send out a request for proposal to other bond dealers. They come back with an average asking price of 109 (109% of notional).

cashflows.IRR cashflows collateral.ask 109 cashflows.IRR[1] -(collateral.ask/100) * collateral.notional (collateral.IRR.1 IRR.1(cashflows.IRR))
## [1] -0.01415712

That’s more like it: about 140 basis points (1.41% x 100 basis points per percentage) cost (negative sign).

Let’s unwind the project, and the collateral transaction, in 1 year. Let’s suppose the yield curve in 1 year has parallel shifted down by 0.005.

rate.shift -0.005 rates.1 c(0, rates[-2]) + rate.shift cashflows.1 c(0, cashflows[-2]) (Value.1 NPV.2(rates.1, cashflows.1))
## [1] 25.37541
(collateral.return.1 Value.1/(-cashflows.IRR[1]) - 1)
## [1] -0.0687923

This results ooks much more than a break-even return on the collateral transation:

(collateral.gainloss collateral.notional * collateral.return.1) * 1e+06
## [1] -1719807
## adjust for millions of euros

That’s probably someone’s salary…(in pounds sterling).

3.3.5 Mind the Interface!

Interfaces mark out a controlled inner environment for our code;

Let’s build (parametric) distributions next.

3.4 Making distributions

As always, let’s load some data, this time from the Bureau of Labor Statistics (BLS) and load the export-import price index whose description and graph are at http://data.bls.gov/timeseries/EIUIR?output_view=pct_1mth. We look up the symbols “EIUIR” and “EIUIR100” and download a text file that we then convert to a comma separated variable or csv file in Excel. We deposit the csv file in a directory and read it into a variable called EIUIR .

require(xts) require(zoo) EIUIR read.csv("data/EIUIR.csv") head(EIUIR)
## Date Value ## 1 2006-01-01 113.7 ## 2 2006-02-01 112.8 ## 3 2006-03-01 112.7 ## 4 2006-04-01 115.1 ## 5 2006-05-01 117.2 ## 6 2006-06-01 117.3
xmprice na.omit(EIUIR) ## to clean up any missing data str(xmprice)
## 'data.frame': 131 obs. of 2 variables: ## $ Date : Factor w/ 131 levels "2006-01-01","2006-02-01". 1 2 3 4 5 6 7 8 9 10 . ## $ Value: num 114 113 113 115 117 . 

We might have to have installed separatetly the xts and zoo packages that handle time series data explicitly. The str() function indicates that the Value column in the data frame contains the export price series. We then compute the natural logarithm of prices and calculate the differences to get rates of growth from month to month. A simple plot reveals aspects of the data series to explore.

xmprice.r as.zoo(na.omit((diff(log(xmprice$Value))))) ## compute rates head(xmprice.r)
## 1 2 3 4 5 ## -0.0079470617 -0.0008869180 0.0210718947 0.0180805614 0.0008528785 ## 6 ## 0.0076433493
plot(xmprice.r, type = "l", col = "blue", xlab = "Date", main = "Monthly 2/2000-9/2016")

We further transform the data exploring the absolute value of price rates. This is a first stab at understanding the clustering of volatility in financial-economic time series, a topic to which we will return.

xmprice.r.df data.frame(xmprice.r, Date = index(xmprice.r), Rate = xmprice.r[, 1], Rate.abs abs(xmprice.r[, 1])) head(xmprice.r.df)
## xmprice.r Date Rate Rate.abs. abs.xmprice.r. 1.. ## 1 -0.0079470617 1 -0.0079470617 0.0079470617 ## 2 -0.0008869180 2 -0.0008869180 0.0008869180 ## 3 0.0210718947 3 0.0210718947 0.0210718947 ## 4 0.0180805614 4 0.0180805614 0.0180805614 ## 5 0.0008528785 5 0.0008528785 0.0008528785 ## 6 0.0076433493 6 0.0076433493 0.0076433493
str(xmprice.r.df)
## 'data.frame': 130 obs. of 4 variables: ## $ xmprice.r : num -0.007947 -0.000887 0.021072 0.018081 0.000853 . ## $ Date : int 1 2 3 4 5 6 7 8 9 10 . ## $ Rate : num -0.007947 -0.000887 0.021072 0.018081 0.000853 . ## $ Rate.abs. abs.xmprice.r. 1. num 0.007947 0.000887 0.021072 0.018081 0.000853 . 

We can achieve a “prettier” plot with the ggplot2 package. In the ggplot statements we use aes , “aesthetics”, to pick x (horizontal) and y (vertical) axes. The added ( + ) geom_line is the geometrical method that builds the line plot.

require(ggplot2) ggplot(xmprice.r.df, aes(x = Date, y = Rate)) + geom_line(colour = "blue")

Let’s try a bar graph of the absolute value of price rates. We use geom_bar to build this picture.

require(ggplot2) ggplot(xmprice.r.df, aes(x = Date, y = Rate.abs)) + geom_bar(stat = "identity", colour = "green")

3.4.1 Try this exercise

Let’s overlay returns ( geom_line ) and their absolute value geom_bar .

By examining this chart, what business questions about your Univeral Export-Import Ltd supply chain could this help answer? Why is this helpful?

require(ggplot2) ggplot(xmprice.r.df, aes(Date, Rate.abs)) + geom_bar(stat = "identity", colour = "darkorange") + geom_line(data = xmprice.r.df, aes(Date, Rate), colour = "blue")

The plot goes a long way to answering the question: When supply and demand tightens, does price volatility cluster?

  1. If we are selling, we would experience strong swings in demand and thus in revenue at the customer fulfillment end of the chain.
  2. If we are buying, we would experience strong swings in cost and input product utilization at the procurement end of the chain.
  3. For the financial implications: we would have a tough time making the earnings we forecast to the market.

3.4.2 Picture this

We import goods as input to our manufacturing process. We might want to know the odds that a very high export-import rate might occur. We can plot a cumulative distribution function (cdf or CDF) call. we can build this plot using the stat_ecdf() function in ggplot2 .

require(ggplot2) ggplot(xmprice.r.df, aes(Rate)) + stat_ecdf(colour = "blue")

3.4.3 Try another exercise

  1. Suppose the procurement team’s delegation of authority remit states: “Procurement may approve input invoices when there is only a 5% chance that prices will rise any higher than the price rate associated with that tolerance. If input prices do rise higher than the tolerable rate, you must get divisional approval.”
  2. Plot a vertical line to indicate the maximum tolerable rate for procurement using the BLS EIUR data from 2000 to the present.

We can implement these requirements with the following code.

require(ggplot2) r.tol.pct 0.95 r.tol quantile(xmprice.r.df$Rate, r.tol.pct) r.tol.label paste("Tolerable Rate = ", round(r.tol, 2)) ggplot(xmprice.r.df, aes(Rate)) + stat_ecdf(colour = "blue", size = 1.5) + geom_vline(xintercept = r.tol, colour = "red", size = 1.5) + annotate("text", x = r.tol - 0.05, y = 0.75, label = r.tol.label, colour = "darkred")

This may be a little more than we bargained for originally. We used the paste and round (to two, 2 , decimal places) functions to make a label. We made much thicker lines ( size = 1.5 ). At 2% we drew a line with geom_vline() and annotated the line with text.

Now that we have made some distributions out of live data, let’s estimate the parameters of specific distributions that might be fit to that data.

3.5 Optimization

The optimization we will conduct here helps us to find the distribution that best fits the data. We will use results from optimization to simulate that data to help us make decisions prospectively.

There are many distributions in R : ?distributions will tell you all about them.

  1. If name is the name of a distribution (e.g., norm for “normal”), then
  1. And ways to write your own (like the pareto distribution we use in finance)

3.5.1 Try this exercise

Suppose the EIUR price series is the benchmark in several import contracts you write as the procurement officer of your organization. Your concern is with volatility. Thus you think that you need to simulate the size of the price rates, whatever direction they go in. Draw the histogram of the absolute value of price rates.

require(ggplot2) r.tol quantile(xmprice.r.df$Rate, 0.95) r.tol.label paste("Tolerable Rate = ", round(r.tol, 2)) ggplot(xmprice.r.df, aes(Rate.abs)) + geom_histogram(fill = "cornsilk", colour = "grey60") + geom_density() + geom_vline(xintercept = r.tol, colour = "red", size = 1.5) + annotate("text", x = 0.055, y = 30, label = r.tol.label, colour = "darkred")

This series is right-skewed and thickly-tailed. We will use this function to pull several of the statistics calculations together.

## r_moments function INPUTS: r vector ## OUTPUTS: list of scalars (mean, sd, ## median, skewness, kurtosis) data_moments function(data) < require(moments) mean.r mean(data) sd.r sd(data) median.r median(data) skewness.r skewness(data) kurtosis.r kurtosis(data) result data.frame(mean = mean.r, std_dev = sd.r, median = median.r, skewness = skewness.r, kurtosis = kurtosis.r) return(result) >

We might need to install.packages("moments") to make this function operate. We then run these sentences.

ans data_moments(xmprice.r.df$Rate.abs) ans round(ans, 4) knitr::kable(ans)
mean std_dev median skewness kurtosis
0.0109 0.0117 0.0074 2.523 12.1545

As we visually surmised, the series is right-skewed and very thickly tailed. This may indicate that the gamma and pareto functions may help us describe these series and prepare us for simulations, estimation, and inference. We will make liberal use of the fitdistr function from MASS and come back to this moments function.

3.6 Estimate until morale improves…

We will try one method that works often enough in practice, Method of Moments (“MM” or, more affectionately, “MOM”). This estimation technique finds the distribution parameters such that the moments of the data match the moments of the distribution. Other methods include:

uppose we believe that absolute price rates somehow follow a gamma distribution. You can look up this distribution easily enough in Wikipedia’s good article on the subject. Behind managerial scenes, we can model the loss with gamma severity function that allows for skewness and “heavy” tails. We can specify the gamma distribution with by shape, \(\alpha\) , and scale, \(\beta\) , parameters. We will find in operational loss analysis that this distribution is especially useful for time-sensitive losses.

It turns out We can specify the shape and scale parameters using the mean, \(\mu\) , and standard deviation, \(\sigma\) of the random severities, \(X\) . The scale parameter is \[ \beta = \sigma^2 / \mu, \] and shape parameter,

\[ \alpha = \mu^2 / \sigma^2. \]

The distribution itself is defined as \[ f(x; alpha, \beta) = \fracx^e^>, \] where, to have a complete statement, \[ \Gamma(x) = \int_^ <\infty>x^ e^ dx. \] Let’s finally implement into R .

First, we will load a cost sample and calculate moments and gamma parameters:

cost read.csv("data/cost.csv") cost cost$x cost.moments data_moments(cost) cost.mean cost.moments$mean cost.sd cost.moments$std_dev (cost.shape cost.mean^2/cost.sd^2)
## [1] 19.06531
(cost.scale cost.sd^2/cost.mean)
## [1] 0.5575862
gamma.start c(cost.shape, cost.scale)

When performing these calculations, be sure that the function data_moments is loaded into the workspace.

Second, we can use fitdistr from the Mass package to estimate the gamma parameters alpha and beta .

require(MASS) fit.gamma.cost fitdistr(cost, "gamma") fit.gamma.cost
## shape rate ## 20.2998092 1.9095724 ## ( 2.3729250) ( 0.2259942)

Third, we construct the ratio of estimates to the standard error of estimates. This computes the number of standard deviations away from zero the estimates are. If they are “far” enough away from zero, we have reason to reject the null hypothesis that the estimates are no different from zero.

(cost.t fit.gamma.cost$estimate/fit.gamma.cost$sd)
## shape rate ## 8.554762 8.449652
knitr::kable(cost.t)
shape 8.554762
rate 8.449652

Nice…but we also note that the scale parameter is fit.gamma.cost$estimate[2] / gamma.start[2] times the moment estimates above.

3.6.1 Try this exercise

Let’s use the export-input price series rates and the t distribution instead of the gamma .

First, we calculate the moments (mean, etc.).

rate xmprice.r.df$Rate rate.moments data_moments(rate) (rate.mean rate.moments$mean)
## [1] 0.0004595748
(rate.sd rate.moments$std_dev)
## [1] 0.01602021

Second, we use fitdistr from the Mass package to estimate the parameters of the t distribution.

fit.t.rate fitdistr(rate, "t", hessian = TRUE) fit.t.rate
## m s df ## 0.001791738 0.009833018 2.888000806 ## (0.001059206) (0.001131725) (0.843729312)

Third, we infer if we did a good job or not. The null hypothesis is that the parameters are no different than zero ( \(H_0\) ). We calculate t statistics to approximate the mapping of parameter estimates to a dimensionless scale that will compute the number of standard deviations from the null hypothesis that the parameters are just zero and of no further use.

## m s df ## 1.691586 8.688522 3.422900

Nice…but that location parameter is a bit low relative to moment estimate. What else can we do? Simulate the estimated results and see if, at least, skewness and kurtosis line up with the moments.

3.7 Summary

We used our newly found ability to write functions and built insightful pictures of distributions. We also ran nonlinear (gamma and t-distributions are indeed very nonlinear) regressions using a package and the method of moments. All of this to answer critical business questions.

More specifically we waded into:

3.8 Further Reading

Teetor’s various chapters have much to guide us in the writing of functions and the building of expressions. Present value and internal rate of return can be found in Brealey et al. Use of ggplot2 in this chapter relies heavily on Chang (2014).

3.9 Practice Sets

These practice sets reference materials developed in this chapter. We will explore new problems and data with models, R packages, tables, and plots worked out already in the chapter.

3.9.1 Set A

In this set we will build a data set using filters and if and diff statements. We will then answer some questions using plots and a pivot table report. We will then review a function to house our approach in case we would like to run the same analysis on other data sets.

3.9.1.1 Problem

Supply chain managers at our company continue to note we have a significant exposure to heating oil prices (Heating Oil No. 2, or HO2), specifically New York Harbor. The exposure hits the variable cost of producing several products. When HO2 is volatile, so is earnings. Our company has missed earnings forecasts for five straight quarters. To get a handle on Brent we download this data set and review some basic aspects of the prices.

# Read in data HO2 read.csv("data/nyhh02.csv", header = T, stringsAsFactors = F) # stringsAsFactors sets dates as # character type head(HO2) HO2 na.omit(HO2) ## to clean up any missing data str(HO2) # review the structure of the data so far

3.9.1.2 Questions

  1. What is the nature of HO2 returns? We want to reflect the ups and downs of price movements, something of prime interest to management. First, we calculate percentage changes as log returns. Our interest is in the ups and downs. To look at that we use if and else statements to define a new column called direction . We will build a data frame to house this analysis.
# Construct expanded data frame return as.numeric(diff(log(HO2$DHOILNYH))) * 100 size as.numeric(abs(return)) # size is indicator of volatility direction ifelse(return > 0, "up", ifelse(return  0, "down", "same")) # another indicator of volatility date as.Date(HO2$DATE[-1], "%m/%d/%Y") # length of DATE is length of return +1: omit 1st observation price as.numeric(HO2$DHOILNYH[-1]) # length of DHOILNYH is length of return +1: omit first observation HO2.df na.omit(data.frame(date = date, price = price, return = return, size = size, direction = direction)) # clean up data frame by omitting NAs str(HO2.df)

We can plot with the ggplot2 package. In the ggplot statements we use aes , “aesthetics”, to pick x (horizontal) and y (vertical) axes. Use group =1 to ensure that all data is plotted. The added ( + ) geom_line is the geometrical method that builds the line plot.

Let’s try a bar graph of the absolute value of price rates. We use geom_bar to build this picture.

Now let’s build an overlay of return on size .

  1. Let’s dig deeper and compute mean, standard deviation, etc. Load the data_moments() function. Run the function using the HO2.df$return subset and write a knitr::kable() report.
  2. Let’s pivot size and return on direction . What is the average and range of returns by direction? How often might we view positive or negative movements in HO2?

3.9.2 Set B

We will use the data from the previous set to investigate the distribution of returns we generated. This will entail fitting the data to some parametric distributions as well as plotting and building supporting data frames.

3.9.2.1 Problem

We want to further characterize the distribution of up and down movements visually. Also we would like to repeat the analysis periodically for inclusion in management reports.

3.9.2.2 Questions

  1. How can we show the differences in the shape of ups and downs in HO2, especially given our tolerance for risk? Let’s use the HO2.df data frame with ggplot2 and the cumulative relative frequency function stat_ecdf .
  2. How can we regularly, and reliably, analyze HO2 price movements? For this requirement, let’s write a function similar to data_moments .

Let’s test HO2_movement() .

Morale: more work today (build the function) means less work tomorrow (write yet another report).

  1. Suppose we wanted to simulate future movements in HO2 returns. What distribution might we use to run those scenarios? Here, let’s use the MASS package’s fitdistr() function to find the optimal fit of the HO2 data to a parametric distribution.

3.9.3 Practice Set Debrief

  1. List the R skills needed to complete these practice sets.
  2. What are the packages used to compute and graph results. Explain each of them.
  3. How well did the results begin to answer the business questions posed at the beginning of each practice set?

3.10 Project

3.10.1 Background

Your company uses natural gas as a major input to recycle otherwise non-recoverable waste. The only thing that prevents the operation from being a little better than break-even is volatile natural gas prices. In its annual review, management requires information and analysis of recycling operations with a view to making decisions about outsourcing contracts. These contracts typically have three year tenors.

Since management will be making do or outsource decisions over a three year forward span, analysts will build models that characterize historical movements in natural gas prices, the volatility of those prices, and probability distributions to simulate future natural gas scenarios.

3.10.2 Data

In a preliminary analysis, you gather data from FRED on daily natural gas prices. You whill use this data to characterize historical natural gas price movements and construct provisional probability distributions for eventual generation of forward scenarios.

3.10.3 Workflow

  1. Data collection. Collect, clean, and review data definitions, and data transformations of price into returns. Use tables and graphs to report results.
  2. Analysis.
  1. Observations and Recommendations.

3.10.4 Assessment

We will use the following rubric to assess our performance in producing analytic work product for the decision maker.