I just released a major update to package GetDFPData. Here are the main changes:
Naming conventions for caching system are improved so that it reflects different versions of FRE and DFP files. This means the old caching system no longer works. If you have built yourself your own cache folder with many companies, do clean up the cache by deleting all folders. Run your code again and it will rebuild all files.
At the end of every year I plan to write about the highlight of the current year and set plans for the future. First, let’s talk about my work in 2018.
Highlights of 2018 Research wise, my scientometrics paper Is predatory publishing a real threat? Evidence from a large database study was featured in many news outlets. Its altmetric page is doing great, with over 1100 downloads and featured at top 5% of all research output measured by altmetric.
I while ago I wrote about purchasing my own webserver in digital ocean and hosting my shinny applications. Last week I finally got some time to migrate my blog from Github to my new domain, www.msperlin.com. While doing that, I also decided to change the technology behind making the blog, from Jekyll to Hugo. Here are my reasons.
Jekyll is great for making simple static sites, specially with this template from Dean Attali.
I’ve been using Rstudio for a long time and I got some tricks to share. These are simple and useful commands and shortcuts that really help the productivity of my students. If you got a suggestion of trick, use the comment section and I’ll add it in this post.
Package rstudioapi When using Rstudio, package rstudioapi gives you lots of information about your session. The most useful one is the script location.
Loops in R First, if you are new to programming, you should know that loops are a way to tell the computer that you want to repeat some operation for a number of times. This is a very common task that can be found in many programming languages. For example, let’s say you invited five friends for dinner at your home and the whole cost of four pizzas will be split evenly.
Its been a while since I develop a CRAN package and this weekend I decided to work on a idea I had some time ago. The result is package PkgsFromFiles.
When working with different computers at home or work, one of the problems I have is installing missing packages across different computers. As an example, a script that works in my work computer may not work in my home computer. This is specially annoying when I have a fresh install of the operating system or R.
Last year I released GetLattesData. This package is very handy for anyone that researches bibliometric data of Brazilian scholars. You could easily import the whole academic history of any researcher registered at the platform. More details about Lattes and GetLattesData in the this post.
However, a couple months ago CNPQ introduced a captcha in the webpage. This made it impossible to download the xml files directly, breaking my code. It seems that those changes are now permanent.
One of the main requests I get for package BatchGetSymbols is to add the choice of frequency of the financial dataset. Today I finally got some time to work on it. I just posted a new version of BatchGetSymbols in CRAN. The major change is that users can now set the time frequency of the financial data: dailly, weekly, monthly or yearly. Let’s check it out:
library(BatchGetSymbols) ## Loading required package: rvest ## Loading required package: xml2 ## Loading required package: dplyr ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ## library(purrr) ## ## Attaching package: 'purrr' ## The following object is masked from 'package:rvest': ## ## pluck library(ggplot2) my.
I recently bought a new computer for home and it came with two drives, one HDD and other SSD. The later is used for the OS and the former stores all of my personal files. From all computers I had, both home and work, this is definitely the fastest. While some of the merits are due to the newer CPUS and RAM, the SSD drive can make all the difference in file operations.
It is with great pleasure that I announce the second edition of the portuguese version of my book, Processing and Analyzing Financial Data with R. This edition updates the material significantly. The portuguese version is now not only in par with the international version of the book, but much more!
Here are the main changes:
The structure of chapters changed towards the stages of a research, from obtaining the raw data, cleaning it, manipulating it and, finally, reporting tables and figures.
I often get asked about how to invest in the stock market. Not surprisingly, this has been a common topic in my classes. Brazil is experiencing a big change in its financial scenario. Historically, fixed income instruments paid a large premium over the stock market and that is no longer the case. Interest rates are low, without the pressure from inflation. This means a more sustainable scenario for low-interest rates in the future.
My paper about the penetration of predatory journals in Brazil, Is predatory publishing a real threat? Evidence from a large database study, just got published in Scientometrics!. The working paper version is available in SSRN.
This is a nice example of a data-intensive scientific work cycle, from gathering data to reporting results. Everything was done in R, using web scrapping algorithms, parallel processing, tidyverse packages and more. This was a special project for me, given its implications in science making in Brazil.
Back in 2007 I wrote a Matlab package for estimating regime switching models. I was just starting to learn to code and this project was my way of doing it. After publishing it in FEX (Matlab file exchange site), I got so many repeated questions on my email that eventually I realized it would be easier to write a manual for people to read. Some time and effort would be spend writing it, but less time replying to repeated questions on my email.
I just released a long due update to package BatchGetSymbols. The files are under review in CRAN and you should get the update soon. Meanwhile, you can install the new version from Github:
if (!require(devtools)) install.packages('devtools') devtools::install_github('msperlin/BatchGetSymbols') The main innovations are:
Clever cache system: By default, every new download of data will be saved in a local file located in a directory chosen by user. Every new request of data is compared to the available local information.
My blog in 2017 As we come close to the end of 2017, its time to look back. This has been a great year for me in many ways. This blog started as a way to write short pieces about using R for finance and promote my book in an organic way. Today, I’m very happy with my decision. Discovering and trying new writing styles keeps my interest very much alive.
In this post I’ll share my experience in setting up my own virtual server for hosting shiny applications in Digital Ocean. First, context. I’m working in a academic project where we build a package for accessing financial data and corporate events directly from B3, the Brazilian financial exchange. The objective is to set a reproducible standard and facilite data acquisition of a large, and very interesting, dataset. The result is GetDFPData.
Financial statements of companies traded at B3 (formerly Bovespa), the Brazilian stock exchange, are available in its website. Accessing the data for a single company is straightforward. In the website one can find a simple interface for accessing this dataset. An example is given here. However, gathering and organizing the data for a large scale research, with many companies and many dates, is painful. Financial reports must be downloaded or copied individually and later aggregated.
The latest version of GetTDData offers function get.yield.curve to download the current Brazilian yield curve directly from Anbima. The yield curve is a financial tool that, based on current prices of fixed income instruments, shows how the market perceives the future real, nominal and inflation returns. You can find more details regarding the use and definition of a yield curve in Investopedia.
Unfortunately, function get.yield.curve only downloads the current yield curve from the website.
Setting a name for a CRAN package is an intimate process. Out of an infinite range of possibilities, an idea comes for a package and you spend at least a couple of days writing up and testing your code before submitting to CRAN. Once you set the name of the package, you cannot change it. Your choice index your effort and, it shouldn’t be a surprise that the name of the package can improve its impact.
I am very please to announce that my book,Processing and Analyzing Financial Data with R, is finally out! This book is an english version of my previous title in portuguese. This is a long term project that I plan to keep on working over the years.
You can find it in Amazon. Following great titles about R, I decided to also publish an online version with full content here. More details about the book, including table of contents, is availabe in its webpage.
Facebook recently released a API package allowing access to its forecasting model called prophet. According to the underling post:
It's not your traditional ARIMA-style time series model. It's closer in spirit to a Bayesian-influenced generalized additive model, a regression of smooth terms. The model is resistant to the effects of outliers, and supports data collected over an irregular time scale (ingliding presence of missing data) without the need for interpolation. The underlying calculation engine is Stan; the R and Python packages simply provide a convenient interface.
Many people, including my university colleagues and friends, have asked me about the process of writing a book and self publishing it in Amazon. You can find the details about the book here and here. Given so much interest, I’m going to report the whole process in this post.
First, motivation. Why did I write a book?
I am a university professor. Writing is a major part of my work and I really enjoy it.
In the previous post about tennis, we studied how changes in ball’s composition in hard and grass courts affected the game back in 2000. In this post, we will analyse a different dataset from the same repository and look at the players winning records in ATP matches.
The data I’m again using the great repository of tennis data of Jeff Sackmann. In this case, however, I’m using the ATP repository that contains ATP match data since 1968 until today.
Part of my job as a researcher and teacher is to periodically apply and grade exams in my classroom. Being constantly in the shoes of an examiner, you soon quickly realize that students are clever in finding ways to do well in an exam without effort. These days, photos and pdf versions of past exams and exercises are shared online in facebook, whatsapp groups, instagram and what not. As weird as it may sound, the distribution of information in the digital era creates a problem for examiners.
One of the first examples about using linear regression models in finance is the calculation of betas, the so called market model. Coefficient beta is a measure of systematic risk and it is calculated by estimating a linear model where the dependent variable is the return vector of a stock and the explanatory variable is the return vector of a diversified local market index, such as SP500 (US), FTSE (UK), Ibovespa (Brazil), or any other.
Recently, Bovespa, the Brazilian financial exchange company, allowed external access to its ftp site. In this address one can find several information regarding the Brazilian financial system, including datasets with high frequency (tick by tick) trading data for three different markets: equity, options and BMF.
Downloading and processing these files, however, can be exausting. The dataset is composed of zip files with the whole trading data, separated by day and market.