One of the first examples about using linear regression models in finance is the calculation of betas, the so called market model. Coefficient beta is a measure of systematic risk and it is calculated by estimating a linear model where the dependent variable is the return vector of a stock and the explanatory variable is the return vector of a diversified local market index, such as SP500 (US), FTSE (UK), Ibovespa (Brazil), or any other.

From the academic side, the calculation of beta is part of a famous asset pricing model, CAPM - Capital Asset Pricing Model, that relates expected return and systematic risk. One can reach the market model equation by assuming several conditions such as Normal distributed returns, rational investors and frictionless market. Summing up, the CAPM model predicts that betas have a linear relationship to expected returns, that is, stocks with higher betas should present, collectively, higher average of historical returns.

In the quantitative side, we can formulate the market model as:

\(R_t = \alpha + \beta R_{M,t} + \epsilon _t\)

where \(R_{t}\) is the return of the stock at time \(t\), \(R_{M,t}\) is the return of the market index, \(\alpha\) is the constant (also called Jensen’s alphas) and, finally, \(\beta\) is the measure of systematic risk for the stock. The values of \(\alpha\) and \(\beta\) are found by minimizing the sum of squared errors of the model. So, if you have a vector of prices for a stock and another vector of prices for the local market index, you can easily find the stock’s beta by calculating the daily returns and estimating the market model by OLS.

The problem here is that, usually, you don’t want the beta of a single stock. You want to calculate the systematic risk for a large number of stocks. This is where students usually have problems, as they only learned in class how to estimate one model. In order to do the same procedure for more than one stock, some programming is needed. This is where R really shines in comparison to simpler programs such as Excel.

In this post I will download some data from the US market, make some adjustments to the resulting dataframe and discuss three ways to calculate the betas of several stocks. These are:

- Using a
`loop`

- Using function
`by`

- Using package
`dplyr`

But first, lets load the data.

# Loading the data and preparing it

I’m a bit biased, but I really like using package `BatchGetSymbols`

to download financial data from yahoo finance. In this example we will download data for 10 stocks selected randomly from the SP500 index. I will also add the ticker `^GSPC`

, which belongs to the SP500 index. We will need it to calculate the betas. In order for the code to be reproducible, I will set `random.seed(100)`

. This means that anyone that runs the code available here will get the exact same results.

`library(BatchGetSymbols)`

`## Loading required package: rvest`

`## Loading required package: xml2`

`## Loading required package: dplyr`

```
##
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
##
## filter, lag
```

```
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
```

`## `

```
set.seed(100)
ticker.MktIdx <- '^GSPC'
first.date <- as.Date('2015-01-01')
last.date <- as.Date('2019-01-01')
n.chosen.stocks <- 10 # can't be higher than 505
# get random stocks
my.tickers <- c(sample(GetSP500Stocks()$Tickers,n.chosen.stocks),
ticker.MktIdx)
l.out <- BatchGetSymbols(tickers = my.tickers,
first.date = first.date,
last.date = last.date)
```

```
##
## Running BatchGetSymbols for:
```

```
##
## tickers =FLS, ZBH, NUE, XEC, XRX, VRSK, FTV, WRK, ROK, VAR, ^GSPC
## Downloading data for benchmark ticker
## ^GSPC | yahoo (1|1) | Found cache file
## FLS | yahoo (1|11) | Found cache file - Got 100% of valid prices | Youre doing good!
## ZBH | yahoo (2|11) | Found cache file - Got 100% of valid prices | Good stuff!
## NUE | yahoo (3|11) | Found cache file - Got 100% of valid prices | Looking good!
## XEC | yahoo (4|11) | Found cache file - Got 100% of valid prices | Youre doing good!
## XRX | yahoo (5|11) | Found cache file - Got 100% of valid prices | Feels good!
## VRSK | yahoo (6|11) | Found cache file - Got 100% of valid prices | Well done!
## FTV | yahoo (7|11) | Found cache file - Got 62% of valid prices | OUT: not enough data (thresh.bad.data = 75%)
## WRK | yahoo (8|11) | Found cache file - Got 88% of valid prices | Well done!
## ROK | yahoo (9|11) | Found cache file - Got 100% of valid prices | Looking good!
## VAR | yahoo (10|11) | Found cache file - Got 100% of valid prices | Nice!
## ^GSPC | yahoo (11|11) | Found cache file - Got 100% of valid prices | Got it!
```

`df.stocks <- l.out$df.tickers`

Now, lets check if everything went well with the import process.

`print(l.out$df.control)`

```
## # A tibble: 11 x 6
## ticker src download.status total.obs perc.benchmark.dates threshold.decisi…
## <chr> <chr> <chr> <int> <dbl> <chr>
## 1 FLS yahoo OK 1006 1 KEEP
## 2 ZBH yahoo OK 1006 1 KEEP
## 3 NUE yahoo OK 1006 1 KEEP
## 4 XEC yahoo OK 1006 1 KEEP
## 5 XRX yahoo OK 1006 1 KEEP
## 6 VRSK yahoo OK 1006 1 KEEP
## 7 FTV yahoo OK 628 0.624 OUT
## 8 WRK yahoo OK 887 0.882 KEEP
## 9 ROK yahoo OK 1006 1 KEEP
## 10 VAR yahoo OK 1006 1 KEEP
## 11 ^GSPC yahoo OK 1006 1 KEEP
```

It seems that everything is Ok. All stocks have column `perc.benchmark.dates`

equal to one (100%), meaning that they have the exact same dates as the benchmark ticker.

Now, lets plot the time series of prices and look for any problem:

```
library(ggplot2)
p <- ggplot(df.stocks, aes(x=ref.date, y=price.adjusted)) +
geom_line() + facet_wrap(~ticker, scales = 'free')
print(p)
```

Again, we see that all prices seems to be Ok. This is one of the advantages of working with adjusted (and not closing) prices from yahoo finance. Artificial effects in the dataset such as ex-dividend prices, splits and inplits are already taken into account and the result is a smooth series without any breaks.

The final step in preparing the data is to add a column with the returns of the market index. This is not strictly necessary but I really like to keep things organized in a tabular way. Since we will match each vector of returns of the stocks to a vector of returns of the market index, it makes sense to *synchronize* the rows in the data.frame. First, we isolate the data for the market index in object `df.MktIdx`

and use function `match`

to make an index that matches the dates between the assets and the market index. We later use this index to build a new column in `df.stocks`

. See the next code:

```
df.MktIdx <- df.stocks[df.stocks$ticker==ticker.MktIdx, ]
idx <- match(df.stocks$ref.date, df.MktIdx$ref.date)
df.stocks$ret.MktIdx <- df.MktIdx$ret.adjusted.prices[idx]
```

Now that we have the data in the correct format and structure, let’s start to calculate some betas. Here is where the different approaches will differ in syntax. Let’s start with the first case, using loops.

# Estimating betas

## Using loops

Loops are great and (almost) everyone loves then. While they can be a bit more verbose than fancy on-liners, the structure of a loop is very flexible and this can help solve complex problems. Let use it in our problem.

The first step in using loops is the understand the vector that will be used as iterator in the loop. In our problem we are processing each stock, so the number of iterations in the loop is simply the number of stocks in the sample. We can find the unique stocks with the command `unique`

.

```
# Check unique tickers
unique.tickers <- unique(df.stocks$ticker)
# create a empty vector to store betas
beta.vec <- c()
for (i.ticker in unique.tickers){
# message to prompt
cat('\nRunning ols for',i.ticker)
# filter the data.frame for stock i.ticker
df.temp <- df.stocks[df.stocks$ticker==i.ticker, ]
# calculate beta with lm
my.ols <- lm(data = df.temp, formula = ret.adjusted.prices ~ ret.MktIdx)
# save beta
my.beta <- coef(my.ols)[2]
# store beta em beta.vec
beta.vec <- c(beta.vec, my.beta)
}
```

```
##
## Running ols for FLS
## Running ols for ZBH
## Running ols for NUE
## Running ols for XEC
## Running ols for XRX
## Running ols for VRSK
## Running ols for WRK
## Running ols for ROK
## Running ols for VAR
## Running ols for ^GSPC
```

```
# print result
print(data.frame(unique.tickers,beta.vec))
```

```
## unique.tickers beta.vec
## 1 FLS 1.2496762
## 2 ZBH 0.9130036
## 3 NUE 1.2551470
## 4 XEC 1.2956483
## 5 XRX 1.1673358
## 6 VRSK 0.8357139
## 7 WRK 1.2841215
## 8 ROK 1.1779125
## 9 VAR 0.9146368
## 10 ^GSPC 1.0000000
```

As you can see, the result is a lengthy code, but it works quite well. The final result is a dataframe with the tickers and their betas. Notice that, as expected, the betas are all positive and `^GSPC`

has a beta equal to 1.

## Using function `by`

Another way of solving the problem is to calculate the betas using one of the functions from the `apply`

family. In this case, we will use function `by`

. Be aware that you can also solve the problem using `tapply`

and `lapply`

. The code, however, will increase in complexity.

The function `by`

works similarly to `tapply`

. The difference is that it is oriented to dataframes. That is, given a grouping variable, the original dataframe is broken into smaller dataframes and each piece is passed to a function. This helps a lot our problem since we need to work with two columns, the vector of returns of the asset and the vector of returns of the market index.

Given the functional form of `by`

, will need to encapsulate a procedure that takes a dataframe as input and returns a coefficient beta, calculated from columns `ret`

and `ret.MktIdx`

. The next code does that.

```
get.beta <- function(df.temp){
# estimate model
my.ols <- lm(data=df.temp, formula = ret.adjusted.prices ~ ret.MktIdx)
# isolate beta
my.beta <- coef(my.ols)[2]
# return beta
return(my.beta)
}
```

The previous function accepts a single dataframe called `df.temp`

, uses it to calculate a linear model with function `lm`

and then returns the resulting beta, which is the second coefficient in `coef(my.ols)`

. Now, lets use it with function `by`

.

```
# get betas with by
my.l <- by(data = df.stocks,
INDICES = df.stocks$ticker,
FUN = get.beta)
# my.l is an objetct of class by. To get only its elements, we can unclass it
betas <- unclass(my.l)
# print result
print(data.frame(betas))
```

```
## betas
## ^GSPC 1.0000000
## FLS 1.2496762
## NUE 1.2551470
## ROK 1.1779125
## VAR 0.9146368
## VRSK 0.8357139
## WRK 1.2841215
## XEC 1.2956483
## XRX 1.1673358
## ZBH 0.9130036
```

Again, it worked well. Needless to say that the results are identical to the previous case.

## Using `dplyr`

Now, let’s solve our problem using package `dplyr`

. If you are not familiar with the *tidyverse* and the work of Hadley Wickham, you will be a happier person after reading the rest of this post, trust me.

Package `dplyr`

is one of my favorites and most used packages. It allows for the representation of data processing procedures in a simpler and more intuitive way. It really helps to tackle computational problems if you can fit it within a flexible structure. This is what, in my opinion, `dplyr`

does best. It combines clever functions with dataframes in the long (tidy) format.

Have a look in the next set of code.

```
library(dplyr)
beta.tab <- df.stocks %>%
group_by(ticker) %>% # group by column ticker
do(ols.model = lm(data = ., formula = ret.adjusted.prices ~ret.MktIdx)) %>% # estimate model
mutate(beta = coef(ols.model)[2]) # get coefficients
print(beta.tab)
```

```
## Source: local data frame [10 x 3]
## Groups: <by row>
##
## # A tibble: 10 x 3
## ticker ols.model beta
## <chr> <list> <dbl>
## 1 ^GSPC <lm> 1.
## 2 FLS <lm> 1.25
## 3 NUE <lm> 1.26
## 4 ROK <lm> 1.18
## 5 VAR <lm> 0.915
## 6 VRSK <lm> 0.836
## 7 WRK <lm> 1.28
## 8 XEC <lm> 1.30
## 9 XRX <lm> 1.17
## 10 ZBH <lm> 0.913
```

After loading `dplyr`

, we use the pipeline operator %>% to streamline all calculations. This means that we don’t need to keep a copy of intermediate calculations. Also, It looks pretty, don’t you agree?

The line `beta.tab <- df.stocks %>%`

passes the dataframe `df.stocks`

for the next line, `group_by(ticker) %>%`

, which will group the dataframe according the the values of column `ticker`

and pass the result for the next step. The line `do(ols.model = lm(data = ., formula = ret ~ret.MktIdx))`

estimates the model by passing a temporary dataframe and saves it in a column called `ols.model`

. Notice that the model is a `S3`

object and not a single value. The dataframe alternative `tibble`

is flexible with its content. The final line, `mutate(beta = coef(ols.model)[2])`

retrieves the beta from each element of the column `ols.model`

.

What I really like about `dplyr`

is that it makes it easy to extend the original code. As an example, if I wanted to use a second grouping variable, I can just add it in the second line as `group_by(ticker, newgroupingvariable)`

. This becomes handy if you need, lets say, to estimated the model in different time periods.

As an example, let’s assume that I want to split the sample for each stock in half and see if the betas change significantly from time period to the other. This robustness check is a very common procedure in scientific research. First, let’s build a new column in `df.stocks`

that sets the time periods as `Sample 1`

and `Sample 2`

. We can use `tapply`

for that;

```
set.sample <- function(ref.dates){
my.n <- length(ref.dates) # get number of dates
my.n.half <- floor(my.n/2) # get aproximate half of observations
# create grouping variable
samples.vec <- c(rep('Sample 1', my.n.half ), rep('Sample 2', my.n-my.n.half))
# return
return(samples.vec)
}
# build group
my.l <- tapply(X = df.stocks$ref.date,
INDEX = df.stocks$ticker,
FUN = set.sample )
# unsort it
my.l <- my.l[my.tickers]
# save it in dataframe
df.stocks$my.sample <- unlist(my.l)
```

We proceed by calling the same functions as before, but using an additional grouping variable.

```
beta.tab <- df.stocks %>%
group_by(ticker,my.sample) %>% # group by column ticker
do(ols.model = lm(data = ., formula = ret.adjusted.prices ~ret.MktIdx)) %>% # estimate model
mutate(beta = coef(ols.model)[2]) # get coefficients
print(beta.tab)
```

```
## Source: local data frame [20 x 4]
## Groups: <by row>
##
## # A tibble: 20 x 4
## ticker my.sample ols.model beta
## <chr> <chr> <list> <dbl>
## 1 ^GSPC Sample 1 <lm> 1.000
## 2 ^GSPC Sample 2 <lm> 1.000
## 3 FLS Sample 1 <lm> 1.23
## 4 FLS Sample 2 <lm> 1.28
## 5 NUE Sample 1 <lm> 1.31
## 6 NUE Sample 2 <lm> 1.19
## 7 ROK Sample 1 <lm> 1.13
## 8 ROK Sample 2 <lm> 1.24
## 9 VAR Sample 1 <lm> 0.982
## 10 VAR Sample 2 <lm> 0.831
## 11 VRSK Sample 1 <lm> 0.809
## 12 VRSK Sample 2 <lm> 0.868
## 13 WRK Sample 1 <lm> 1.38
## 14 WRK Sample 2 <lm> 1.18
## 15 XEC Sample 1 <lm> 1.46
## 16 XEC Sample 2 <lm> 1.10
## 17 XRX Sample 1 <lm> 1.26
## 18 XRX Sample 2 <lm> 1.05
## 19 ZBH Sample 1 <lm> 0.888
## 20 ZBH Sample 2 <lm> 0.944
```

As we can see, the output now shows the beta for all combinations between `ticker`

and `my.sample`

. It seems that the betas tend to be higher for *Sample 2*, meaning that the overall systematic risk in the market has increased over time, at least for the majority of the ten chosen stocks. Given the small sample of stocks, It might be interesting to test for this property in a larger dataset.

Back to the model, if you want more information about it, you can just write new lines in the last call to %>%. Let’s say, for example, that you want to get the value of alpha and the corresponding t-statistic of both coefficients. We can use the following code for that:

```
library(dplyr)
beta.tab <- df.stocks %>%
group_by(ticker) %>% # group by column ticker
do(ols.model = lm(data = ., formula = ret.adjusted.prices ~ret.MktIdx)) %>% # estimate model
mutate(beta = coef(ols.model)[2],
beta.tstat = summary(ols.model)[[4]][2,3],
alpha = coef(ols.model)[1],
alpha.tstat = summary(ols.model)[[4]][1,3]) # get coefficients
```

```
## Warning in summary.lm(ols.model): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(ols.model): essentially perfect fit: summary may be
## unreliable
```

`print(beta.tab)`

```
## Source: local data frame [10 x 6]
## Groups: <by row>
##
## # A tibble: 10 x 6
## ticker ols.model beta beta.tstat alpha alpha.tstat
## <chr> <list> <dbl> <dbl> <dbl> <dbl>
## 1 ^GSPC <lm> 1. 1.59e17 -5.47e-20 -1.01
## 2 FLS <lm> 1.25 2.22e 1 -5.02e- 4 -1.03
## 3 NUE <lm> 1.26 2.44e 1 3.45e- 5 0.0780
## 4 ROK <lm> 1.18 3.03e 1 2.23e- 4 0.667
## 5 VAR <lm> 0.915 2.20e 1 2.74e- 4 0.769
## 6 VRSK <lm> 0.836 2.33e 1 4.10e- 4 1.33
## 7 WRK <lm> 1.28 2.16e 1 -4.52e- 4 -0.873
## 8 XEC <lm> 1.30 1.81e 1 -5.78e- 4 -0.937
## 9 XRX <lm> 1.17 2.05e 1 -6.22e- 4 -1.27
## 10 ZBH <lm> 0.913 2.17e 1 -1.64e- 4 -0.451
```

In the previous code, I added line `beta.tstat = summary(ols.model)[[4]][2,3]`

that returns the t-statistic of the beta coefficient. The location of this parameter is found by investigating the elements of an object of type `lm`

. After calling `summary`

, the t-statistic is available in the fourth element of the `lm`

object, which is a matrix with several information from the estimation. The t-statistic for the alpha parameter is found in a similar way.

As you can see, the syntax of `dplyr`

make it easy to extend the model and quickly try new things. It is possible to do the same using other R functions and a loop, but using `dplyr`

is really handy.

# Conclusions

As you can probably suspect from the text, I’m a big fan of `dplyr`

and I’m always teaching its use to my students. While loops are ok and I personally use then a lot in more complex problems, the functions in `dplyr`

allow for an intuitive syntax in data processing, making it easy to understand and extend code.

Do notice that the code in this example is self contained and reproducible. If you want to try it for more stocks, just change input `n.chosen.stocks`

to a higher value.