Using R to download high frequency trade data diretcly from Bovespa

Using package GetHFData

Recently, Bovespa, the Brazilian financial exchange company, allowed external access to its ftp site. In this address one can find several information regarding the Brazilian financial system, including datasets with high frequency (tick by tick) trading data for three different markets: equity, options and BMF.

Downloading and processing these files, however, can be exausting. The dataset is composed of zip files with the whole trading data, separated by day and market. These files are huge in size and processing or aggregating them in a usefull manner requires specific knowledge for the structure of the dataset.

The package GetHFData make is easy to access this dataset directly by allowing the easy importation and aggregations of it. Based on this package the user can:

  • Access the contents of the Bovespa ftp using function function ghfd_get_ftp_contents
  • Get the list of available ticker in the trading data using ghfd_get_available_tickers_from_ftp
  • Download individual files using ghfd_download_file
  • Download and process a batch of dates and assets codes with ghfd_get_HF_data

More details about the usage of the package can be found in my SSRN paper. Next, we present an example of how to use the package in an empirical application. This particular example is based in @perlin2016gethfdata.

Liquidity and the time of the day

In order to illustrate the usage of the software with aggregated data, the chosen problem is the analysis of the intraday U shaped pattern of liquidity in the equity market. This particular issue has been found and discussed in several papers from the literature such as @admati1988theory, @back1998long, @engle1998autoregressive, @gross2011machines, among many others.

The data used in this empirical study is related to the six most traded assets in the period of fifteen trading days from 2016-09-12 until 2016-09-30. The use of a small time period is not accidental. We chose to keep fifteen days as it facilitates the replication of the example by decreasing the time needed to download the dataset.

The first step is to select the liquid assets to run the empirical research. To do that, we select the six most traded assets in the last date of the study (2016-09-30) by checking the available tickers from the ftp site in this date. The following code executes this procedure.

library(GetHFData)

n.assets <- 6
my.date <- as.Date('2018-10-01')
type.market <- 'equity'

df.tickers <- ghfd_get_available_tickers_from_ftp(my.date = my.date, 
                                                  type.market = type.market)
## 
## Reading ftp contents for equity(trades) (attempt = 1|10)
## Warning in ghfd_get_available_tickers_from_ftp(my.date = my.date, type.market
## = type.market): Cant find date 2018-10-01 in ftp. Selecting the closest date,
## 2019-06-25
##  Attempt 1

As explained before, function ghfd_get_available_tickers_from_ftp will output a vector with the number of trades for each ticker found in the dataset. As a robustness check, we can use package ggplot2 [@ggplot2] to create a figure to illustrate the number of trades for each of the 25 most traded stocks in the date of 2016-09-30, as shown next.

library(ggplot2)

temp.df <- df.tickers[1:25, ]

p <- ggplot(temp.df, aes(x = reorder(tickers, -n.obs), y = n.obs)) +
  geom_bar(stat = "identity") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  labs(x = 'Tickers', y = 'Number of trades')

print(p)

From the previous figure we can see that the six most traded assets in 2018-10-01 are PETR4, BBAS3, BOVA11, B3SA3, ITUB4, BBDC4. A particular feature of the high frequency data from Brazil is that the liquidity is disperse and decreases rapidly across the assets, as we can see from the previosu graphic. Even though we are only looking at trading data for one day, we can expect that the number of trades will also drop quickly in other time periods.

From the programming side, the dataframe df.tickers is already sorted by the number of trades so, in order to select the six most traded assets, we select the first six elements of df.tickers$tickers.

my.assets <- df.tickers$tickers[1:n.assets]

And now we can print it to check its content:

print(my.assets)
## [1] PETR4  BBAS3  BOVA11 B3SA3  ITUB4  BBDC4 
## 527 Levels: AALR3 AAPL34 ABCB4 ABCP11 ABEV3 ADHM3 AFLT3 AGRO3 ALMI11 ... XTED11

We continue the empirical example by using the package GetHFData to download and aggregate the desired information for later analysis. The first step in this stage is to set the options for downloading the dataset. Notice that it is good policy to set the object my.folder as the name of a folder in the computer’s hard disk where the user has writing permission in order to download the files. We set an example path as PATH TO YOUR FOLDER HERE. We make it clear that the user has to modify this object in order for the code to run without error. Users in the Windows platform should be aware that the folder path has to be set using forward slashes (\(/\)) and not backslashes, which is the default.

As for the intraday time periods, we use a first time of 10:30:00 and last as 16:30:00 in order to avoid the trading noise from the opening and closing of the market, which could bias our results. The options used with GeHFData are set as follows.

my.folder<-'PATH TO YOUR FOLDER HERE'
#setwd(my.folder)

first.time <- '10:30:00'
last.time <- '16:30:00'

first.date <- as.Date('2019-09-12')
last.date <- as.Date('2019-09-20')
type.output <- 'agg'
agg.diff <- '15 min'

my.assets <- c("ITSA4", "PETR4", "ITUB4", "BBDC4", "ABEV3", "BBSE3")
type.market <- 'equity'

After setting the inputs, we now use function ghfd_get_HF_data to download and aggregate the financial data.

 df.out <- ghfd_get_HF_data(my.assets = my.assets,
                           type.market = type.market,
                           first.date = first.date,
                           last.date = last.date,
                           first.time = first.time,
                           last.time = last.time,
                           type.output = type.output,
                           agg.diff = agg.diff)
## 
## Running ghfd_get_HF_Data for:
##    type.market = equity
##    type.data = trades
##    my.assets = ITSA4 PETR4   ITUB4   BBDC4   ABEV3   BBSE3
##    type.output = agg
##       agg.diff = 15 min
## Reading ftp contents for equity(trades) (attempt = 1|10)
##    Found  96  files in ftp
##    First available date in ftp:  2019-07-01
##    Last available date in ftp:   2019-11-12
##    First date to download:  2019-09-12
##    Last date to download:   2019-09-20
## Downloading ftp files/NEG_20190912.gz (1|7) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  1540631 lines, 542 unique tickers
##    -> Processing file - Found 169814 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20190913.gz (2|7) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  1424364 lines, 538 unique tickers
##    -> Processing file - Found 196135 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20190916.gz (3|7) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  1689542 lines, 532 unique tickers
##    -> Processing file - Found 258701 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20190917.gz (4|7) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  1466912 lines, 552 unique tickers
##    -> Processing file - Found 165068 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20190918.gz (5|7) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  1364363 lines, 536 unique tickers
##    -> Processing file - Found 147674 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20190919.gz (6|7) Attempt 1
##    -> Reading files - Imported  1607724 lines, 538 unique tickers
##    -> Processing file - Found 169585 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20190920.gz (7|7) Attempt 1
##    -> Reading files - Imported  1568240 lines, 536 unique tickers
##    -> Processing file - Found 159072 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows

We point out that the previous code will take some time to finish as it has to download and read several large files from Bovespa ftp site. Once it is finished, we can check the output of ghfd_get_HF_data by calling function str for the object df.out, which will show the textual representation of the object in the R environment.

str(df.out)
## 'data.frame':    1008 obs. of  13 variables:
##  $ InstrumentSymbol: chr  "ABEV3" "ABEV3" "ABEV3" "ABEV3" ...
##  $ SessionDate     : Date, format: "2019-09-12" "2019-09-12" ...
##  $ TradeDateTime   : POSIXct, format: "2019-09-12 10:30:00" "2019-09-12 10:45:00" ...
##  $ n.trades        : int  911 979 1222 1619 1717 1582 1255 1411 1234 1531 ...
##  $ last.price      : num  19.1 19.1 19.2 19.3 19.3 ...
##  $ weighted.price  : num  19.2 19.1 19.2 19.2 19.3 ...
##  $ period.ret      : num  0 -0.000522 0.004182 0.003123 0.002076 ...
##  $ period.ret.volat: num  0.000244 0.000272 0.000235 0.000271 0.000308 ...
##  $ sum.qtd         : num  732600 890100 1174100 1192200 1160700 ...
##  $ sum.vol         : num  14034611 17012274 22498051 22942758 22400181 ...
##  $ n.buys          : int  513 444 582 1000 1017 727 821 783 632 874 ...
##  $ n.sells         : int  398 535 640 619 700 855 434 628 602 657 ...
##  $ Tradetime       : chr  "10:30:00" "10:45:00" "11:00:00" "11:15:00" ...
##  - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  5 variables:
##   ..$ row     : int  1540639 1540639
##   ..$ col     : chr  "SessionDate" NA
##   ..$ expected: chr  "date like " "18 columns"
##   ..$ actual  : chr  "RT NEG                 2019-09-12 2019-09-12 001540640" "1 columns"
##   ..$ file    : chr  "'ftp files/NEG_20190912.gz'" "'ftp files/NEG_20190912.gz'"

As described earlier, the object returned from ghfd_get_HF_data is a dataframe with several columns calculated from the raw data. Notice that the columns already have the correct class, which facilitates the future manipulation of the data.

Once the data is available, we proceed to the analysis of the intraday pattern of liquidity. To do so, we use the number of trades as a proxy for liquidity. The analysis will be based on the visual examination of a figure that relates the distribution of number of trades to the time of the day. Since the number of trades are not comparable across assets, we plot the same figure for different stocks. Next, we show the R code that creates the figure based on the ggplot2 package.

p <- ggplot(df.out, aes(x =  Tradetime, y = n.trades)) + 
  geom_boxplot() + coord_cartesian(ylim = c(0, 3000)) + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  facet_wrap(~InstrumentSymbol) + 
  labs(y='Number of Trades', x = 'Time of Day')

print(p)

In the previous figure we show the number of trades as a function of the time of the day. As expected, we find that the intraday shape of liquidity follows a U pattern, that is, the number of trades rises in the beginning and ending of the day, with the smallest value around 13:15:00. Such a pattern is found for the great majority of the assets.

This result is supported by previous findings in the literature [@engle1998autoregressive,@gross2011machines]. In the beginning of the trading day, a significant volume of overnight information is priced at the market, which justifies the increase of the number of trades. As for the end of the day, the higher volume of trades can be explained as a inventory strategy by the investors or market makers, which aims to finish the day with null portfolio positions in order to avoid the overnight risk. Since the decrease of portfolio size is achieved with more trades, we see a significant increase of negotiations at the end of the day. Interestingly, this pattern for liquidity is correlated to a pattern of intraday volatility [@andersen1997intraday].

Avatar
Marcelo S. Perlin
Associate Professor of Finance

Related

comments powered by Disqus