Using R to download high frequency trade data diretcly from Bovespa

Using package GetHFData

Recently, Bovespa, the Brazilian financial exchange company, allowed external access to its ftp site. In this address one can find several information regarding the Brazilian financial system, including datasets with high frequency (tick by tick) trading data for three different markets: equity, options and BMF.

Downloading and processing these files, however, can be exausting. The dataset is composed of zip files with the whole trading data, separated by day and market. These files are huge in size and processing or aggregating them in a usefull manner requires specific knowledge for the structure of the dataset.

The package GetHFData make is easy to access this dataset directly by allowing the easy importation and aggregations of it. Based on this package the user can:

  • Access the contents of the Bovespa ftp using function function ghfd_get_ftp_contents
  • Get the list of available ticker in the trading data using ghfd_get_available_tickers_from_ftp
  • Download individual files using ghfd_download_file
  • Download and process a batch of dates and assets codes with ghfd_get_HF_data

More details about the usage of the package can be found in my SSRN paper. Next, we present an example of how to use the package in an empirical application. This particular example is based in @perlin2016gethfdata.

Liquidity and the time of the day

In order to illustrate the usage of the software with aggregated data, the chosen problem is the analysis of the intraday U shaped pattern of liquidity in the equity market. This particular issue has been found and discussed in several papers from the literature such as @admati1988theory, @back1998long, @engle1998autoregressive, @gross2011machines, among many others.

The data used in this empirical study is related to the six most traded assets in the period of fifteen trading days from 2016-09-12 until 2016-09-30. The use of a small time period is not accidental. We chose to keep fifteen days as it facilitates the replication of the example by decreasing the time needed to download the dataset.

The first step is to select the liquid assets to run the empirical research. To do that, we select the six most traded assets in the last date of the study (2016-09-30) by checking the available tickers from the ftp site in this date. The following code executes this procedure.

library(GetHFData)
## Thank you for using GetHFData! More details about the package can be found in:
## 
##  http://bibliotecadigital.fgv.br/ojs/index.php/rbfin/article/view/64587/65702 
## 
##  If applicable, please use the following citations in your research report. Thanks! 
## 
## APA:
##  Perlin, M., Ramos, H. (2016). GetHFData: A R Package for Downloading and Aggregating High Frequency Trading Data from Bovespa. Brazilian Review of Finance, V. 14, N. 3. 
## 
## bibtex:
##  @article{perlin2016gethfdata,
##   title={GetHFData: A R Package for Downloading and Aggregating High Frequency Trading Data from Bovespa},
##   author={Perlin, Marcelo and Henrique, Ramos},
##   journal={Brazilian Review of Finance},
##   volume={14},
##   number={3},
##   year={2016},
##   publisher={Brazilian Society of Finance}
## }
n.assets <- 6
my.date <- as.Date('2018-10-01')
type.market <- 'equity'

df.tickers <- ghfd_get_available_tickers_from_ftp(my.date = my.date, 
                                                  type.market = type.market)
## 
## Reading ftp contents for equity(trades) (attempt = 1|10) Attempt 1 - File exists, skipping dl

As explained before, function ghfd_get_available_tickers_from_ftp will output a vector with the number of trades for each ticker found in the dataset. As a robustness check, we can use package ggplot2 [@ggplot2] to create a figure to illustrate the number of trades for each of the 25 most traded stocks in the date of 2016-09-30, as shown next.

library(ggplot2)

temp.df <- df.tickers[1:25, ]

p <- ggplot(temp.df, aes(x = reorder(tickers, -n.obs), y = n.obs))
p <- p + geom_bar(stat = "identity")
p <- p + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
p <- p + labs(x = 'Tickers', y = 'Number of trades')
print(p)

From the previous figure we can see that the six most traded assets in 2018-10-01 are QUAL3, PETR4, VVAR11, BBAS3, B3SA3, ITUB4. A particular feature of the high frequency data from Brazil is that the liquidity is disperse and decreases rapidly across the assets, as we can see from the previosu graphic. Even though we are only looking at trading data for one day, we can expect that the number of trades will also drop quickly in other time periods.

From the programming side, the dataframe df.tickers is already sorted by the number of trades so, in order to select the six most traded assets, we select the first six elements of df.tickers$tickers.

my.assets <- df.tickers$tickers[1:n.assets]

And now we can print it to check its content:

print(my.assets)
## [1] QUAL3  PETR4  VVAR11 BBAS3  B3SA3  ITUB4 
## 540 Levels: AALR3 AAPL34 ABBV34 ABCB10 ABCB4 ABCP11 ABEV3 ABTT34 ... XTED11

We continue the empirical example by using the package GetHFData to download and aggregate the desired information for later analysis. The first step in this stage is to set the options for downloading the dataset. Notice that it is good policy to set the object my.folder as the name of a folder in the computer’s hard disk where the user has writing permission in order to download the files. We set an example path as PATH TO YOUR FOLDER HERE. We make it clear that the user has to modify this object in order for the code to run without error. Users in the Windows platform should be aware that the folder path has to be set using forward slashes (\(/\)) and not backslashes, which is the default.

As for the intraday time periods, we use a first time of 10:30:00 and last as 16:30:00 in order to avoid the trading noise from the opening and closing of the market, which could bias our results. The options used with GeHFData are set as follows.

my.folder<-'PATH TO YOUR FOLDER HERE'
#setwd(my.folder)

first.time <- '10:30:00'
last.time <- '16:30:00'

first.date <- as.Date('2018-09-12')
last.date <- as.Date('2018-09-20')
type.output <- 'agg'
agg.diff <- '15 min'

my.assets <- c("ITSA4", "PETR4", "ITUB4", "BBDC4", "ABEV3", "BBSE3")
type.market <- 'equity'

After setting the inputs, we now use function ghfd_get_HF_data to download and aggregate the financial data.

 df.out <- ghfd_get_HF_data(my.assets = my.assets,
                           type.market = type.market,
                           first.date = first.date,
                           last.date = last.date,
                           first.time = first.time,
                           last.time = last.time,
                           type.output = type.output,
                           agg.diff = agg.diff)
## 
## Running ghfd_get_HF_Data for:
##    type.market = equity
##    type.data = trades
##    my.assets = ITSA4 PETR4   ITUB4   BBDC4   ABEV3   BBSE3
##    type.output = agg
##       agg.diff = 15 min
## Reading ftp contents for equity(trades) (attempt = 1|10)
##    Found  80  files in ftp
##    First available date in ftp:  2018-07-02
##    Last available date in ftp:   2018-10-24
##    First date to download:  2018-09-12
##    Last date to download:   2018-09-20
## Downloading ftp files/NEG_20180912.zip (1|7) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  978608 lines, 475 unique tickers
##    -> Processing file - Found 151681 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20180913.zip (2|7) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  992576 lines, 530 unique tickers
##    -> Processing file - Found 140134 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20180914.zip (3|7) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  1009750 lines, 500 unique tickers
##    -> Processing file - Found 172014 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20180917.zip (4|7) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  935071 lines, 499 unique tickers
##    -> Processing file - Found 153235 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20180918.zip (5|7) Attempt 1
##    -> Reading files - Imported  1095981 lines, 532 unique tickers
##    -> Processing file - Found 184779 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20180919.zip (6|7) Attempt 1
##    -> Reading files - Imported  1022844 lines, 509 unique tickers
##    -> Processing file - Found 154009 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20180920.zip (7|7) Attempt 1
##    -> Reading files - Imported  866078 lines, 503 unique tickers
##    -> Processing file - Found 131235 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows

We point out that the previous code will take some time to finish as it has to download and read several large files from Bovespa ftp site. Once it is finished, we can check the output of ghfd_get_HF_data by calling function str for the object df.out, which will show the textual representation of the object in the R environment.

str(df.out)
## 'data.frame':    1008 obs. of  13 variables:
##  $ InstrumentSymbol: chr  "ABEV3" "ABEV3" "ABEV3" "ABEV3" ...
##  $ SessionDate     : Date, format: "2018-09-12" "2018-09-12" ...
##  $ TradeDateTime   : POSIXct, format: "2018-09-12 10:30:00" "2018-09-12 10:45:00" ...
##  $ n.trades        : int  368 301 768 1068 1061 450 653 687 718 794 ...
##  $ last.price      : num  18.3 18.4 18.4 18.4 18.4 ...
##  $ weighted.price  : num  18.2 18.3 18.3 18.4 18.4 ...
##  $ period.ret      : num  0.0044 0.00328 0 0.00218 -0.00109 ...
##  $ period.ret.volat: num  0.000289 0.000256 0.000251 0.000236 0.000223 ...
##  $ sum.qtd         : num  182700 150900 5389300 770700 745600 ...
##  $ sum.vol         : num  3330899 2763266 98842512 14172885 13731413 ...
##  $ n.buys          : int  242 210 578 747 599 261 420 400 468 390 ...
##  $ n.sells         : int  126 91 190 321 462 189 233 287 250 404 ...
##  $ Tradetime       : chr  "10:30:00" "10:45:00" "11:00:00" "11:15:00" ...

As described earlier, the object returned from ghfd_get_HF_data is a dataframe with several columns calculated from the raw data. Notice that the columns already have the correct class, which facilitates the future manipulation of the data.

Once the data is available, we proceed to the analysis of the intraday pattern of liquidity. To do so, we use the number of trades as a proxy for liquidity. The analysis will be based on the visual examination of a figure that relates the distribution of number of trades to the time of the day. Since the number of trades are not comparable across assets, we plot the same figure for different stocks. Next, we show the R code that creates the figure based on the ggplot2 package.

p <- ggplot(df.out, aes(x =  Tradetime, y = n.trades))
p <- p + geom_boxplot() + coord_cartesian(ylim = c(0, 3000))
p <- p  + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
p <- p + facet_wrap(~InstrumentSymbol)
p <- p + labs(y='Number of Trades', x = 'Time of Day')
print(p)

In the previous figure we show the number of trades as a function of the time of the day. As expected, we find that the intraday shape of liquidity follows a U pattern, that is, the number of trades rises in the beginning and ending of the day, with the smallest value around 13:15:00. Such a pattern is found for the great majority of the assets.

This result is supported by previous findings in the literature [@engle1998autoregressive,@gross2011machines]. In the beginning of the trading day, a significant volume of overnight information is priced at the market, which justifies the increase of the number of trades. As for the end of the day, the higher volume of trades can be explained as a inventory strategy by the investors or market makers, which aims to finish the day with null portfolio positions in order to avoid the overnight risk. Since the decrease of portfolio size is achieved with more trades, we see a significant increase of negotiations at the end of the day. Interestingly, this pattern for liquidity is correlated to a pattern of intraday volatility [@andersen1997intraday].

Avatar
Marcelo S. Perlin
Associate Professor of Finance

Related

comments powered by Disqus