Major update to BatchGetSymbols

Making it even easier to download and organize stock prices from Yahoo Finance

I just released a long due update to package BatchGetSymbols. The files are under review in CRAN and you should get the update soon. Meanwhile, you can install the new version from Github:

if (!require(devtools)) install.packages('devtools')
devtools::install_github('msperlin/BatchGetSymbols')

The main innovations are:

  • Clever cache system: By default, every new download of data will be saved in a local file located in a directory chosen by user. Every new request of data is compared to the available local information. If data is missing, the function only downloads the piece of data that is missing. This make the call to function BatchGetSymbols a lot faster! When updating an existing dataset of prices, the function only downloads the missing part of the data.

  • Returns calculation: Function now returns a return vector in df.tickers. Returns are used a lot more than prices in research. No reason why they should be keep out of the output.

  • Wide format: Added function for converting data to the wide format. In some situations, such as portfolio analysis, the wide format makes a lot of sense and is required for some methodologies.

  • Ibovespa composition: Added function for downloading current Ibovespa composition directly from Bovespa website.

In the next chunks of code I show some of the innovations:

library(BatchGetSymbols)
## Loading required package: rvest
## Loading required package: xml2
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
# download Ibovespa stocks
my.tickers <- GetSP500Stocks()$tickers[1:5] # lets keep it light

# set dates
first.date <- '2017-01-01'
last.date <- '2018-01-01'

# set folder for cache system
my.temp.cache.folder <- 'BGS_CACHE'

# get data and time it
time.nocache <- system.time({
my.l <- BatchGetSymbols(tickers = my.tickers, first.date, last.date, 
                        cache.folder = my.temp.cache.folder, do.cache = FALSE)
})
## 
## Running BatchGetSymbols for:
##    tickers = MMM, ABT, ABBV, ABMD, ACN
##    Downloading data for benchmark ticker
## MMM | yahoo (1|5) - OK!
## ABT | yahoo (2|5) - OK!
## ABBV | yahoo (3|5) - Boa!
## ABMD | yahoo (4|5) - Nice!
## ACN | yahoo (5|5) - Nice!
time.withcache <- system.time({
my.l <- BatchGetSymbols(tickers = my.tickers, first.date, last.date, 
                        cache.folder = my.temp.cache.folder, do.cache = TRUE)
})
## 
## Running BatchGetSymbols for:
##    tickers = MMM, ABT, ABBV, ABMD, ACN
##    Downloading data for benchmark ticker | Not Cached
## MMM | yahoo (1|5) | Not Cached - Youre doing good!
## ABT | yahoo (2|5) | Not Cached - OK!
## ABBV | yahoo (3|5) | Not Cached - You got it!
## ABMD | yahoo (4|5) | Not Cached - Good job!
## ACN | yahoo (5|5) | Not Cached - Well done!
cat('\nTime with no cache:', time.nocache['elapsed'])
## 
## Time with no cache: 5.146
cat('\nTime with cache:', time.withcache['elapsed'])
## 
## Time with cache: 1.693

Now let’s check the default output with data in the long format:

dplyr::glimpse(my.l)
## List of 2
##  $ df.control:'data.frame':  5 obs. of  6 variables:
##   ..$ ticker              : Factor w/ 5 levels "MMM","ABT","ABBV",..: 1 2 3 4 5
##   ..$ src                 : Factor w/ 1 level "yahoo": 1 1 1 1 1
##   ..$ download.status     : Factor w/ 1 level "OK": 1 1 1 1 1
##   ..$ total.obs           : int [1:5] 251 251 251 251 251
##   ..$ perc.benchmark.dates: num [1:5] 1 1 1 1 1
##   ..$ threshold.decision  : Factor w/ 1 level "KEEP": 1 1 1 1 1
##  $ df.tickers:'data.frame':  1255 obs. of  10 variables:
##   ..$ price.open         : num [1:1255] 179 178 178 177 178 ...
##   ..$ price.high         : num [1:1255] 180 179 179 179 178 ...
##   ..$ price.low          : num [1:1255] 177 178 177 176 177 ...
##   ..$ price.close        : num [1:1255] 178 178 178 178 177 ...
##   ..$ volume             : num [1:1255] 2509300 1542000 1447800 1625000 1617800 ...
##   ..$ price.adjusted     : num [1:1255] 171 171 170 171 170 ...
##   ..$ ref.date           : Date[1:1255], format: "2017-01-03" ...
##   ..$ ticker             : chr [1:1255] "MMM" "MMM" "MMM" "MMM" ...
##   ..$ ret.adjusted.prices: num [1:1255] NA 0.00152 -0.00342 0.00293 -0.00539 ...
##   ..$ ret.closing.prices : num [1:1255] NA 0.00152 -0.00342 0.00293 -0.00539 ...

And change the format of the long dataframe to wide:

l.wide <- reshape.wide(my.l$df.tickers) 

Now we check the matrix of prices:

print(head(l.wide$price.adjusted))
##     ref.date     ABBV   ABMD      ABT      ACN      MMM
## 1 2017-01-03 57.95297 112.36 37.47772 112.1350 170.6262
## 2 2017-01-04 58.77012 115.74 37.77525 112.4046 170.8849
## 3 2017-01-05 59.21585 114.81 38.10156 110.7197 170.3004
## 4 2017-01-06 59.23442 115.42 39.13807 111.9810 170.7987
## 5 2017-01-09 59.62442 117.11 39.09969 110.7293 169.8787
## 6 2017-01-10 59.49442 112.24 39.62754 110.7870 169.2175

and matrix of returns:

print(head(l.wide$ret.adjusted.prices))
##     ref.date          ABBV         ABMD           ABT           ACN
## 1 2017-01-03            NA           NA            NA            NA
## 2 2017-01-04  0.0141002957  0.030081853  0.0079387696  0.0024043154
## 3 2017-01-05  0.0075841938 -0.008035252  0.0086381959 -0.0149904655
## 4 2017-01-06  0.0003135985  0.005313126  0.0272039787  0.0113922416
## 5 2017-01-09  0.0065840607  0.014642203 -0.0009808097 -0.0111780039
## 6 2017-01-10 -0.0021803315 -0.041584860  0.0135001340  0.0005217229
##            MMM
## 1           NA
## 2  0.001516438
## 3 -0.003420717
## 4  0.002925953
## 5 -0.005386183
## 6 -0.003892418
Avatar
Marcelo S. Perlin
Associate Professor of Finance

Related

comments powered by Disqus