Introduction In my latest post I wrote about package GetEdgarData, which downloaded structured data from the SEC. I’ve been working on this project and soon realized that the available data at the SEC/DERA section is not complete. For example, all Q4 statements are missing. This seems to be the way all exchanges release the financial documents. I’ve found the same problem here in the Brazilian exchange.
It came to my attention that there is an alternative way of fetching corporate data and adjusted prices, the SimFin project.
Introduction As of 2019-10-31, this package is discontinued and will not longer be updated. See this post for more details about the alternative, package simfinR.
Every company traded in the US stock market must report its quarterly and yearly documents to the SEC and the public in general. This includes its accounting statements (10-K, 10-K) and any other corporate event that is relevant to investors.
Edgar is the interface where we can search for a company’s filling information.
The shiny version of GetDFPData is currently hosted in a private server at DigitalOcean. A problem with the basic (5 USD) server I was using is with the low amount of available memory (RAM and HD). With that, I had to limit all xlsx queries for the data, otherwise the shiny app would ran out of memory. After upgrading R in the server, the xlsx option was no longer working.
Introduction Quandl is one of the best platforms for finding and downloading financial and economic time series. The collection of free databases is solid and I use it intensively in my research and class material.
But, a couple of things from the native package Quandl always bothered me:
Multiple data is always returned in the wide (column oriented) format (why??); No local caching of data; No control for importing error and status; Not easy to work within the tidyverse collection of packages As you suspect, I decided to tackle the problem over the weekend.
Update 2019-08-09: The shutdown is just postponed to 2019-11-14. See the official release here.
Surprise, surprise. B3’s ftp site is still up and running.
Following previous post regarding the shutdown of B3’s ftp site and its impact over GetHFData, I’m happy to report that the site is up and running.
We can check it with code:
library(GetHFData) library(tidyverse) df.ftp <- ghfd_get_ftp_contents(type.market = 'equity') ## ## Reading ftp contents for equity(trades) (attempt = 1|10) # check time difference max(df.
Well, bad news travels fast.
Over the last couple of weeks I’ve been receiving a couple of emails regarding B3’s decision of shutting down its ftp site. More specifically, users are eager to know how it will impact my data grabbing packages in CRAN. I’ll use this post to explain the situation for everyone.
The only package affected directly will be GetHFData, which uses the ftp site for downloading the raw zipped files with trades and quotes.
One of the investment concepts that every long term investor should know is the effect of consistency over corporate performance. The main idea is that older and profitable companies are likely to continue to be profitable and even improve its performance in the upcoming years. Likewise, companies with constant losses are likely to continue in the same path.
This idea is related to the Lindy Effect. Quoting directly from wikipedia:
I’m using R for at least five years and always been curious about its usage in Brazil. I see some minor personal evidence that the number of users is increasing over time. My book in portuguese is steadily increasing its sales, and I’ve been receiving far more emails about my R packages. Conference are also booming. Every year there are at least two or three R conferences in Brazil.
One of the subjects that I teach in my undergraduate finance class is the relationship between risk and expected returns. In short, the riskier the investment, more returns should be expected by the investor. It is not a difficult argument to make. All that you need to understand is to remember that people are not naive in financial markets. Whenever they make a big gamble, the rewards should also be large.
The Central Bank of Brazil (BCB) offers access to the SGS system (sistema gerenciador de series temporais) with a official API available here.
Over time, I find myself using more and more of the available datasets in my regular research and studies. Last weekend I decided to write my own API package that would make my life (and others) a lot easier.
Package GetBCBData can fetch data efficiently and rapidly:
BatchGetSymbols is my most downloaded package by any count. Computation time, however, has always been an issue. While downloading data for 10 or less stocks is fine, doing it for a large ammount of tickers, say the SP500 composition, gets very boring.
I’m glad to report that time is no longer an issue. Today I implemented a parallel option for BatchGetSymbols. If you have a high number of cores in your computer, you can seriously speep up the importation process.
In the last few weeks we’ve seen a great deal of controversy in Brazil regarding financial investments. Too keep it short, Empiricus, an ad-based company that massively sells online courses and subscriptions, posted a YouTube ad where a young girl, Bettina, says the following:
Hi, I'm Bettina, I am 22 years old and, starting with R$ 1,500, I now own R$ 1,042,000 of accumulated wealth. She later explains that she earned the money by investing in the stock market over three years.
I received many messages regarding my book promotion (see previous post ). I’ll use this post to answer the most frequent questions:
Does the paperback edition have a discount?
No. The price drop is only valid for the ebook edition but not by choice. Unfortunately, Amazon does not let me do countdown promotions for the paperback edition.
So, in favor of those, like myself, that like the smell of a fresh book page, I manually dropped the price of the paperback to 17.
I recently did a book promotion for my R book in portuguese and it was a big sucess!
My english book is now being sold with the same promotion. You can purchase it with a 50% discount if you buy it on the 10th day of march. See it here. The discount will be valid throughout the week, with daily price increases.
If you want to learn more about R and its use in Finance and Economics, this book is a great opportunity.
I just released a major update to package GetDFPData. Here are the main changes:
Naming conventions for caching system are improved so that it reflects different versions of FRE and DFP files. This means the old caching system no longer works. If you have built yourself your own cache folder with many companies, do clean up the cache by deleting all folders. Run your code again and it will rebuild all files.
At the end of every year I plan to write about the highlight of the current year and set plans for the future. First, let’s talk about my work in 2018.
Highlights of 2018 Research wise, my scientometrics paper Is predatory publishing a real threat? Evidence from a large database study was featured in many news outlets. Its altmetric page is doing great, with over 1100 downloads and featured at top 5% of all research output measured by altmetric.
I while ago I wrote about purchasing my own webserver in digital ocean and hosting my shinny applications. Last week I finally got some time to migrate my blog from Github to my new domain, www.msperlin.com. While doing that, I also decided to change the technology behind making the blog, from Jekyll to Hugo. Here are my reasons.
Jekyll is great for making simple static sites, specially with this template from Dean Attali.
I’ve been using Rstudio for a long time and I got some tricks to share. These are simple and useful commands and shortcuts that really help the productivity of my students. If you got a suggestion of trick, use the comment section and I’ll add it in this post.
Package rstudioapi When using Rstudio, package rstudioapi gives you lots of information about your session. The most useful one is the script location.
Loops in R First, if you are new to programming, you should know that loops are a way to tell the computer that you want to repeat some operation for a number of times. This is a very common task that can be found in many programming languages. For example, let’s say you invited five friends for dinner at your home and the whole cost of four pizzas will be split evenly.
Its been a while since I develop a CRAN package and this weekend I decided to work on a idea I had some time ago. The result is package PkgsFromFiles.
When working with different computers at home or work, one of the problems I have is installing missing packages across different computers. As an example, a script that works in my work computer may not work in my home computer.
Last year I released GetLattesData. This package is very handy for anyone that researches bibliometric data of Brazilian scholars. You could easily import the whole academic history of any researcher registered at the platform. More details about Lattes and GetLattesData in the this post.
However, a couple months ago CNPQ introduced a captcha in the webpage. This made it impossible to download the xml files directly, breaking my code. It seems that those changes are now permanent.
One of the main requests I get for package BatchGetSymbols is to add the choice of frequency of the financial dataset. Today I finally got some time to work on it. I just posted a new version of BatchGetSymbols in CRAN. The major change is that users can now set the time frequency of the financial data: dailly, weekly, monthly or yearly. Let’s check it out:
library(BatchGetSymbols) ## Loading required package: rvest ## Loading required package: xml2 ## Loading required package: dplyr ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ## library(purrr) ## ## Attaching package: 'purrr' ## The following object is masked from 'package:rvest': ## ## pluck library(ggplot2) my.
I recently bought a new computer for home and it came with two drives, one HDD and other SSD. The later is used for the OS and the former stores all of my personal files. From all computers I had, both home and work, this is definitely the fastest. While some of the merits are due to the newer CPUS and RAM, the SSD drive can make all the difference in file operations.
It is with great pleasure that I announce the second edition of the portuguese version of my book, Processing and Analyzing Financial Data with R. This edition updates the material significantly. The portuguese version is now not only in par with the international version of the book, but much more!
Here are the main changes:
The structure of chapters changed towards the stages of a research, from obtaining the raw data, cleaning it, manipulating it and, finally, reporting tables and figures.
I often get asked about how to invest in the stock market. Not surprisingly, this has been a common topic in my classes. Brazil is experiencing a big change in its financial scenario. Historically, fixed income instruments paid a large premium over the stock market and that is no longer the case. Interest rates are low, without the pressure from inflation. This means a more sustainable scenario for low-interest rates in the future.
My paper about the penetration of predatory journals in Brazil, Is predatory publishing a real threat? Evidence from a large database study, just got published in Scientometrics!. The working paper version is available in SSRN.
This is a nice example of a data-intensive scientific work cycle, from gathering data to reporting results. Everything was done in R, using web scrapping algorithms, parallel processing, tidyverse packages and more. This was a special project for me, given its implications in science making in Brazil.
Back in 2007 I wrote a Matlab package for estimating regime switching models. I was just starting to learn to code and this project was my way of doing it. After publishing it in FEX (Matlab file exchange site), I got so many repeated questions on my email that eventually I realized it would be easier to write a manual for people to read. Some time and effort would be spend writing it, but less time replying to repeated questions on my email.
I just released a long due update to package BatchGetSymbols. The files are under review in CRAN and you should get the update soon. Meanwhile, you can install the new version from Github:
if (!require(devtools)) install.packages('devtools') devtools::install_github('msperlin/BatchGetSymbols') The main innovations are:
Clever cache system: By default, every new download of data will be saved in a local file located in a directory chosen by user. Every new request of data is compared to the available local information.
My blog in 2017 As we come close to the end of 2017, its time to look back. This has been a great year for me in many ways. This blog started as a way to write short pieces about using R for finance and promote my book in an organic way. Today, I’m very happy with my decision. Discovering and trying new writing styles keeps my interest very much alive.
In this post I’ll share my experience in setting up my own virtual server for hosting shiny applications in Digital Ocean. First, context. I’m working in a academic project where we build a package for accessing financial data and corporate events directly from B3, the Brazilian financial exchange. The objective is to set a reproducible standard and facilite data acquisition of a large, and very interesting, dataset. The result is GetDFPData.
Financial statements of companies traded at B3 (formerly Bovespa), the Brazilian stock exchange, are available in its website. Accessing the data for a single company is straightforward. In the website one can find a simple interface for accessing this dataset. An example is given here. However, gathering and organizing the data for a large scale research, with many companies and many dates, is painful. Financial reports must be downloaded or copied individually and later aggregated.
The latest version of GetTDData offers function get.yield.curve to download the current Brazilian yield curve directly from Anbima. The yield curve is a financial tool that, based on current prices of fixed income instruments, shows how the market perceives the future real, nominal and inflation returns. You can find more details regarding the use and definition of a yield curve in Investopedia.
Unfortunately, function get.yield.curve only downloads the current yield curve from the website.
Setting a name for a CRAN package is an intimate process. Out of an infinite range of possibilities, an idea comes for a package and you spend at least a couple of days writing up and testing your code before submitting to CRAN. Once you set the name of the package, you cannot change it. Your choice index your effort and, it shouldn’t be a surprise that the name of the package can improve its impact.
I am very please to announce that my book,Processing and Analyzing Financial Data with R, is finally out! This book is an english version of my previous title in portuguese. This is a long term project that I plan to keep on working over the years.
You can find it in Amazon. Following great titles about R, I decided to also publish an online version with full content here. More details about the book, including table of contents, is availabe in its webpage.
Facebook recently released a API package allowing access to its forecasting model called prophet. According to the underling post:
It's not your traditional ARIMA-style time series model. It's closer in spirit to a Bayesian-influenced generalized additive model, a regression of smooth terms. The model is resistant to the effects of outliers, and supports data collected over an irregular time scale (ingliding presence of missing data) without the need for interpolation.
Many people, including my university colleagues and friends, have asked me about the process of writing a book and self publishing it in Amazon. You can find the details about the book here and here. Given so much interest, I’m going to report the whole process in this post.
First, motivation. Why did I write a book?
I am a university professor. Writing is a major part of my work and I really enjoy it.
In the previous post about tennis, we studied how changes in ball’s composition in hard and grass courts affected the game back in 2000. In this post, we will analyse a different dataset from the same repository and look at the players winning records in ATP matches.
The data I’m again using the great repository of tennis data of Jeff Sackmann. In this case, however, I’m using the ATP repository that contains ATP match data since 1968 until today.
Part of my job as a researcher and teacher is to periodically apply and grade exams in my classroom. Being constantly in the shoes of an examiner, you soon quickly realize that students are clever in finding ways to do well in an exam without effort. These days, photos and pdf versions of past exams and exercises are shared online in facebook, whatsapp groups, instagram and what not. As weird as it may sound, the distribution of information in the digital era creates a problem for examiners.
One of the first examples about using linear regression models in finance is the calculation of betas, the so called market model. Coefficient beta is a measure of systematic risk and it is calculated by estimating a linear model where the dependent variable is the return vector of a stock and the explanatory variable is the return vector of a diversified local market index, such as SP500 (US), FTSE (UK), Ibovespa (Brazil), or any other.
Recently, Bovespa, the Brazilian financial exchange company, allowed external access to its ftp site. In this address one can find several information regarding the Brazilian financial system, including datasets with high frequency (tick by tick) trading data for three different markets: equity, options and BMF.
Downloading and processing these files, however, can be exausting. The dataset is composed of zip files with the whole trading data, separated by day and market.