Analysis of High Intensity Interval Training with R

High Intensity Interval Training has gained great prominence in recent years. This type of training intersperses short periods of very intense exercise with periods of more gentle exercise to recover. Rachel is a fan and uses a version where she does 6 one minute all out sprints separated by 5 one minute walks. The running app we use (Runkeeper) produces a graph of speed (or pace) vs time but smooths the graph to a sinusoidal curve rather than something more akin to a square wave. She asked me if I could do some analysis. In particular she wanted to know whether her first minute run, when she was fresh, was faster than the rest.
Fortunately it is very easy to download the gpx file for the run from the Runkeeper website. Reading this in to R is straightforward using readGPX from the plotKML package. A bit of processing gives a data frame of longitude, latitude, elevation and time. Adding the lagged data for longitude, latitude and time gives me everything I need in each row of the data frame to calculate the speed between that time and the previous time.
Plotting the speed vs time has the expected pattern with the periods of running and walking very clear. There is a fair amount of scatter, presumably because the gps is not that accurate over short times and distances – hence the smoothing in the first place.
I then used an if_else to add a new variable of “R” if the speed was greater than 9 kph and “W” if it was less. I also removed the final anomalous point. Grouping by this  Run/Walk variable gave me:
# A tibble: 2 x 3
  st    meanspeed meanpace
1 R         12.8      4.81
2 W          6.05    10.5
The challenge then is to look at each 1 minute section separately. To do this I introduced a new variable which is 1 if “R” has changed to “W” (or vice versa) or 0 otherwise. The final new variable is a cumulative total of these 1’s and 0’s. Now we are in business! Summarising on this final variable gives me the speed for each of the 11 stages:
# A tibble: 11 x 4
   Stage Type  meanspeed meanpace
1     1 R         14.4      4.29
2     2 W          6.35     9.92
3     3 R         12.3      4.93
4     4 W          5.73    11.4
5     5 R         12.5      4.90
6     6 W          6.38     9.87
7     7 R         12.6      4.83
8     8 W          6.03    10.8
9     9 R         12.1      4.99
10    10 W          5.66    11.0
11    11 R         12.6      4.93
So I’ve answered Rachel’s question. She was quite a lot faster in her first sprint and then incredibly consistent for the remaining 5 sprints. Putting the average data on to the plot was more difficult than I thought it would be. The obvious solution is to draw line segments for each of the 11 stages which requires the starting and finishing x and y values which I put into a separate data frame. The elegant solution (see is to draw a trend line using a linear model but force it to ignore the value of time. If you run a linear model and say that the x variable has no power to explain anything it will just give you the mean of the y value because in the absence of anything else that is the best guess.


It’s not perfect. Some of the points were captured at the transition between running and walking and I also ignored the few seconds between each section. Calculating a mean speed for a stage by averaging the instantaneous speeds is only strictly accurate if each point is recorded at regular time intervals. Doing the calculation by summing all the distances and dividing by the total time was very close (less than 5% difference) to the more straightforward calculation of the mean.
I also read the gpx file into the excellent gps track editor ( to check that the distances between each point were correct and the route plotted on the map was reasonably accurate.
Here is the complete code. It assumes you have downloaded your gpx file (HIIT.gpx) into your working directory. It’s entirely possible that the gpx files created by other apps have a slightly different structure so you may need to adapt the code accordingly. The distance is calculated from the Haversine formula

great_circle_distance <- function(lat1, long1, lat2, long2) {
  a <- sin(0.5 * (lat2 - lat1)/360*2*pi)
  b <- sin(0.5 * (long2 - long1)/360*2*pi)
  12742 * asin(sqrt(a * a + cos(lat1/360*2*pi) * 
cos(lat2/360*2*pi) * b * b))

gpx.raw <- readGPX("HIIT.gpx")
df <-[[4]])
colnames(df) <- c("lon","lat","ele","time")

df2 <- df %>% mutate(ele = as.numeric(ele), 
   time  = ymd_hms(time), 
   lon2 = lag(lon), 
   time2 = lag(time), 
   deltat = time - time2,
   lat2 = lag(lat), 
   dist = great_circle_distance(lat, lon,lat2, lon2), 
   speed = dist/as.numeric(deltat)*3600,
   pace = as.numeric(deltat)/dist/60,
   st = if_else(speed >= 9, "R", "W"),
   change = if_else(st == lag(st), 0 ,1)) %>% 
drop_na() %>% 
mutate(Stage = cumsum(change))

df2 %>% ggplot(aes(time, speed)) + geom_line(colour = "grey") + geom_point()

df2 %>% filter(pace < 22) %>% group_by(st) %>% summarise(meanspeed = mean(speed), meanpace = mean(pace))

df2 %>%  filter(pace < 22) %>% group_by(Stage) %>% summarise(Type = first(st),  meanspeed = mean(speed), meanpace = mean(pace))

df2 %>% filter(pace < 22) %>%
  ggplot(aes(time, speed, colour = st, group = Stage)) +
  geom_line(aes(time, speed),colour = "grey") +
  geom_point() +
  stat_smooth(method="lm", formula=y~1, se=F, colour = "black") +
  labs(title = "Rachel's HIIT 22/4/20", y = "Speed in kmh")

A Quick Comparison Plot using LSE Share Prices Sourced from Quandl

If you have read my last blog post you will know that I have now ditched Yahoo Finance as my information source for UK shares because of the data quality problems. Instead I’m using Quandl. The Quandl website has a nice plotting facility but unfortunately only for a single share, making comparisons more difficult. This short R script allows you to obtain data from quandl (for as many securities as you want) and then do a quick comparison plot. You can change the frequency of plotting and the start date. Because it is specifically for comparison, the script discards rows of data where information is missing. Here is the script for ISF (ISH CORE FTSE100 ETF) and CUKX (ISH FTSE100 ACC ETF). If you ever needed an example of the power of reinvesting dividends, this has to be a good one!


tickers <- c("ISF", "CUKX")  ## put as many securities as you want here
mydata <- Quandl(paste0("LSE/", tickers, ".1"), type = "xts", api_key = "XXXXXX")
mydata <- na.omit(mydata)
colnames(mydata) <- tickers

startdate <- "2010-01-01/"  
mydataused <- mydata[startdate]
indx <- xts(apply(mydataused, 2, function(x) x/x[1]), = index(mydataused) )

indx2plt <- indx[endpoints(indx, on="months"),]    # "days", “weeks”, “months”, “quarters” or “years”

chart.TimeSeries(indx2plt, main  =  "Cumulative Returns", legend.loc = "topleft",
                 minor.ticks = F, lwd = 2 )



Time to ditch Yahoo Finance as the go-to source for stock price data?

I am a UK based, private investor. I like to do my own analysis of potential stock purchases and R is my tool of choice for much of this. One of the biggest problems as a private individual is obtaining high quality data – I just can’t justify the cost of a commercial data supplier. The normal approach is to use Yahoo Finance to download the data you need and this can easily be automated into R – either manually or using the quantmod package. However I’m increasingly frustrated by Yahoo finance. In fairness I’m only concerned with shares and ETFs that trade on the London Stock Exchange and it’s entirely possible that Yahoo doesn’t have these problems for US quoted shares. My main concerns are as follows.

1. Data Quality

I’ve written about this before. Here is another example, this time for iShares Core FTSE 100 Dist (ISF.L). I’m assuming the currency keeps changing between  GBX, GBP and either USD or EUR.

2. Dividends

Dividends quoted on Yahoo are sometimes in the wrong currency. (e.g VUSA).

3. Adjusted Close Prices

This is a bit hit and miss. As you can see from the screen shot above the price has been adjusted. Frequently there is no adjustment and no dividends listed when they have been paid.

4. Questions about the long term future of the service

People are frequently predicting the imminent demise of the service. This may or may not be true, but doesn’t fill me with confidence given the other more serious problems listed.

The solution

Enter Quandl and specifically the data provided by the London Stock exchange. R has a specific package (Quandl) which makes downloading data incredibly easy. You can download data for a number of different securities in one go, select what data (columns and date ranges) you actually want and download directly as an xts object. You will need to register to obtain a free api key if you are going to make more than 50 calls per day on the api. (Replace your XXXXXX in my code with your key). The documentation is here and the cran page is here.
## Download all available data for ticker ISF
mydata <- Quandl("LSE/ISF", type = "xts", api_key= XXXXXXX)

                 Price   High    Low  Volume Last Close Change  Var%
2017-06-19 743.8 744.80 740.45 1788670      743.8    5.9  0.80
2017-06-20 738.4 761.45 738.20 2543313      738.4   -5.4 -0.73
2017-06-21 735.7 742.45 732.80 3275787      735.7   -2.7 -0.37
2017-06-22 734.9 735.90 730.95 2692453      734.9   -0.8 -0.11
2017-06-23 733.0 744.80 685.95 4659435      733.0   -1.9 -0.26
2017-06-26 735.2 738.85 735.10 2919664      735.2    2.2  0.30

## Download the first column (close) for ISF and VMID from 2016-12-31 until the end of the data
mydata2 <- Quandl(c("LSE/ISF.1", "LSE/VMID.1"), type = "xts", start_date="2016-12-31", 
                api_key= XXXXXXX)

                 LSE.ISF - Price LSE.VMID - Price
2017-01-03           708.9            29.17
2017-01-04           709.3            29.19
2017-01-05           710.0            29.46
2017-01-06           711.3            29.50
2017-01-09           714.0            29.61
2017-01-10           717.4            29.66

Obtaining the meta data which goes with each data set is a little more tricky as I can’t get the metaData function within the Quandl package to work for these data sets. The approach is therefore to download the xml file that contains the meta data, convert that into a data frame within R and then do some further processing to get the data I actually want.
xml.url <- ""
xmldata <- getURL(xml.url)
xmldf <- xmlToDataFrame(xmldata, stringsAsFactors = F)
stockinfo <- xmldf[, c(2, 4, 8, 7)]
colnames(stockinfo) <- c("Code", "Name", "From", "To")
stockinfo$Currency <- substr(stockinfo$Name, nchar(stockinfo$Name)-2, nchar(stockinfo$Name))
stockinfo$Name <-  substr(stockinfo$Name,1,nchar(stockinfo$Name)-14)

Code  Name                               From          To              Currency
ISF   ISH COREFTSE100 ETF price (ISF)    2006-03-16    2017-06-26      GBX

Issues with Quandl Data

The solution is not perfect
  • You do not get full OHLC data if that is important to you.  However there is high, low, close and volume so you could use the previous close as open
  • There is no index data. Because data for total return indexes are really difficult to find I have always used accumulating ETFs if possible (e.g. CUKX for FTSE100). Personally if I’m using an index for comparison purposes then I would prefer to use something I could actually invest in ie an ETF, rather than a theoretical uninvestible index.
  • No dividends or adjusted close – more on this in a later blog post

Saving Reports Generated in R to Evernote

Evernote is my digital filing cabinet. I keep lots of information in it because it is backed up, easy to find and with me wherever I am on all my devices. It is not surprising therefore that I frequently want to store analysis that I carry out in R as an Evernote note.
To produce reports I prefer to use the spin function in Knitr. I love spin because you don’t need a separate markdown file. You can still run the entire script if you want to, because all the markdown formatting instructions just look like comments to regular R. More information on the advantages of spin vs regular markdown here.
Evernote (at least on Windows, I don’t know about other systems) has a wonderful device called an import folder. Any files added to this folder are automatically added to Evernote. In most cases the file is simply attached to a new note. However in the case of html, txt and rtf files, the contents of the file is added to the body of a new note and the file is not attached.  Initially I thought that saving my html document produced by spin to the Evernote import folder would be all I would need to do. Sadly it wasn’t straightforward because it seems that the import folder doesn’t play nicely with local html files –  the more advanced formatting produced by spin is destroyed and images are sometimes not included.
After some experimentation my preferred option (option 6 in the table below) is as follows:
1. Save the output of spin as an html file to Dropbox.
2. Go to Dropbox on the web and open the document (ctrl clicking opens it in a new window without Dropbox clutter). It looks like a regular html web page.
3. Use the Evernote webclipper to save the file to Evernote
Although this is not ideal because there are still some manual steps, it is the only way I could find of achieving goals 1 and 3 in the table below and being able to at least semi-automate the process.
The very short script required to produce the file and take you to Dropbox is as follows, where the R script file referred to (spinexample.R) reproduced below is a properly formatted spin script see here for details . You will need to wait for a few seconds for Dropbox to upload the file to the web before you can see it.
Script that produces report:
opts_chunk$set(warning=FALSE,  results = "markup",
              out.width = "100%", echo = TRUE)
              output_dir = "C:/Users/Mike/Dropbox/R2EN")

The report script (spinexample.R)

#' ---
#' title: "Spin to Evernote Example"
#' author: "Mike"
#' ---
x <- rnorm(1000)
#' Here is the mean. **It's a very important number.**
#' Here is a pretty histogram
#+ echo=FALSE
#' This is the summary of the random numbers
#' Here is a sum

For reference here is a brief summary of the different approaches I tried. Note that none of these have a full line of crosses. An Evernote note showing the results for options 3, 4 and 6 can be accessed here

Goal 1 – Advanced formatting (highlighted code etc)
Goal 2 – Completely Automatic
Goal 3 – Content forms the body of the note
Manually copying and pasting output into Evernote
Use spin to save an html file to the Evernote import folder
html must not be self contained (otherwise it leaves out any images)
Use spin to save a Word document or pdf document to the Evernote import folder
Use spin to save a rtf document to the Evernote import folder
Because the the render to rtf doesn’t support advanced formatting anyway you actually see in EN what you get in the rtf document
Render as html. Copy and paste content into a gmail. Either webclip or send to Evernote
Preferred option
Use spin to save an html file to Dropbox and then use the webclipper

R for Excel Users – Part 5 – How to Save the Results of Your Analysis

One of the things you need to adapt to, if you are used to Excel and start using R, is producing nice output. It’s fairly straightforward in Excel to produce pretty tables and graphs and place them in a coherent order with accompanying text or alternatively to dynamically link those tables and graphs to a word document. R, in this respect at least, is much more like a pocket calculator than a spreadsheet – it produces great results but you then have to do something with those results to present or record them.
The answer, of course, is the R package knitr. This is built in to RStudio and produces fantastic, publication quality, reports containing text, tables, figures and graphs pretty much out of the box. It is very easy to produce a markdown document that contains all these elements and there are lots of written and video tutorials available to tell you how to do it. If, like me, you often just want ti print the results and not the code, this is easy to configure as well.
The problem is that isn’t how I work! I, like most other people I expect, develop my ideas in a script. I add things, then tinker and tweak the script until I’m happy with it. Only then do I think about how I want to save or report the results of my analysis. If I used knitr in “beginner mode” I would then need to copy the script into a markdown document and work from there. But if I changed the script I would need to copy and paste the chunks again. The other problem I have is that pressing the knitr button doesn’t allow you to control the output file name or destination. I routinely run the same script with different data. Unless I remembered to manually rename my last analysis, knitr would simply overwrite it with no warnings.
It is possible to run the code in a markdown document without rendering the whole document, but it can get messy. For example If you use xtable to produce html tables you will get the html code for the table in the console not the actual table.

Solution 1

A quick and dirty way of producing an output document directly from your script in Rstudio is file -> knit (ctrl-shift-k). This produces an html, pdf or word document including all the code and the results of running the code with the graphs inline in the correct place. However, it doesn’t solve the overwrite problem. To get around this problem I have a separate script file containing the following code. It assumes that the script you want to render (“script.R”), is in the current working directory. It produces an html document with a name based on the current date and time and then displays it in your browser. This is quite sufficient for me in almost all cases.

library(rmarkdown )
filename <- paste("Test_" ,format (Sys.time(), format = "%Y%m%d_%H%M" ),".html" , sep="" )
render("script.R" ,"html_document" , output_file = filename )
browseURL(paste ('file:///', getwd(), "/",filename, sep=''))


Solution 2

If you want to include some nice markdown text in your document but don’t want to copy the R code from elsewhere, for the reasons described above, you can simply run sections of code contained in the script file from within the markdown document. All you need to do is firstly source the R script file and then put the section of the script you want to run after the in “`r in a markdown document and before any options you choose.

library(knitr )
```{r cache =FALSE, echo =F}
read_chunk("script.R" )
```{r Section12, echo =F}

The section is indicated by putting # —- Section12 —- in the script document.

I tend to put all the major data analysis in my main script document. At the bottom I put some code to then render the R markdown document. The R markdown document contains text, instructions on printing the tables and links to code in the script document to print graphs etc. Because I force knitr to work in the global environment (which it doesn’t by default), I don’t need to recalculate things that have already been calculated in the main script.
Here’s an example to see how it works. Both files should be in the same directory. Note that the style tags just add some css  information to improve the formatting of the table which is rendered using xtable.  Running the script file is all you need to do to produce the html output.

R script file (correlation2.R)

# ---- Section1 ----

# Calculate x

x <- rnorm(500, 0.001, 0.02)
xbar <- mean(x)
xsd <- sd(x)

# Calculate y which is inversely proportional to x

y <- 2*xbar-x + rnorm(500, sd = 0.02)
ybar <- mean(y)
ysd <- sd(y)

# ---- Section2 ----

xysummary <- matrix(data = c(xbar, xsd, ybar, ysd), 2, 2)
colnames(xysummary) <- c("x","y")
rownames(xysummary) <- c("Mean","sd")

# ---- Section3 ----

plot(x,y, main = "Correlation between y and x")

# ---- Section4 ----

cc <- cor(x,y)

# ---- Section5 ----

render("correlation2.Rmd","html_document", output_file  = "CorExpt.html", envir = .GlobalEnv)
browseURL(paste('file:///',getwd(),"/","CorExpt.html", sep='')) 

R markdown file (correlation2.Rmd)

output: html_document
<style type="text/css">


border:1px solid black;
text-align: center;


border:1px solid black;
text-align: center;



```{r cache=FALSE, echo=F}

# Correlation Experiment

First we are going to generate some normally distributed weekly returns (x) and then we're then going to calculate a second negatively correlated stock (y) from this data.


```{r echo=F, results='asis'}
print(xtable(xysummary, digits = 4), type="html")

Let's now check the correlation by producing a scatter plot and calculating the correlation coefficient

```{r Section3, echo=F}

Correlation coefficient = `r cc` 

Final output (Screen print so that the styling is retained)


Getting round some Yahoo Finance problems using Google Finance and R

Yahoo Finance is probably the most widely used free source of historic share price data. Normally it works really well and the integration into R packages like quantmod makes it an easy to use and powerful solution. However, it is not without problems. Here are a couple of them and some solutions using Google Finance. I’m only going to talk about London traded shares and UK based mutual funds. I will be using the UK version of Google finance throughout.

Obviously spurious data

This normally only happens for shares with small trading volumes or with ETFs and not for FTSE350 companies. Here are charts for the same ETF (Vanguard S&P 500, VUSA.L) from Yahoo Finance and Google Finance. Using quantmod you can easily change the source to use Google instead of Yahoo. The moral of the story is to always check the basic chart before attempting any more sophisticated analysis.

VUSA YahooVUSA Google

Missing data

It is often impossible to find mutual fund data or more exotic indicies on Yahoo Finance. Sometimes this data is available on Google and bringing up the page shows historic data but frequently with no download option. The reason, I understand, is to do with the licence that Google has with the data provider. In this case, calls to the API using quantmod or a manual solution seem to fail as well.
You could of course manually cut and paste the data into a spreadsheet. However as the data appears on several pages this becomes tedious if you want larger amounts of data. It is worth noting that although the drop down at the bottom of the page gives a maximum of 30 lines per page, manually typing in 200 in the appropriate place in the url gives you 200 lines.
It is very easy to set up R to do the data scraping for you. The script below works for me.
You will need to use the Google finance web page to find the stock code you are interested in. I’m going to use the FTSE100 minimum variance index (Google code is INDEXFTSE:UKXMV) in this example – an index I couldn’t find on Yahoo and where there is no option to download the data on Google. You will need to manually add symb, fname, startdate, and enddate to the script.  If there is less data available than you have asked for you may get an error when it tries to bind a non existent file to your data frame.

### Complete data required below

fname <- "UKXMV"
startdate <- ymd("2014-01-31")
enddate <- ymd("2015-11-16")

## Construct basic URL 

sd <- day(startdate)
sm <- month(startdate, label = T)
sy <- year(startdate)

ed <- day(enddate)
em <- month(enddate, label = T)
ey <- year(enddate)

basicURL <- paste("",symb,
                  "&startdate=",sm,"+",sd,"+",sy,"&enddate=",em,"+",ed,"+",ey,"&num=200&start=", sep ="")

nrows <- as.numeric(difftime( enddate, startdate, units = "days")/7*5)

### Download data and merge into a data frame called DF

start <- seq( 0, nrows, by = 200) 

for(i in 1:length(start)){

theurl <- getURL(paste(basicURL,start[i],sep = ""))  
temp <- readHTMLTable(theurl, which= 4, header = TRUE, stringsAsFactors = FALSE ) 
DF <- rbind(DF, temp)

## Tidy up DF
colnames(DF) <- gsub("\\s|\\\n",'',colnames(DF))

DF[DF == "-"] <- NA


DF[,1] <- mdy(DF[,1])
row.names(DF) <- DF[,1]
DF[,1] <- NULL

id <- c(1:ncol(DF)) 
DF[,id] <- as.numeric(as.character(unlist(DF[,id])))


## Convert the data frame DF into an XTS object

DF.xts <- as.xts(DF)
assign(fname, DF.xts)


R for Excel Users – Part 4 – Pasting into a filtered table

If you have ever tried to paste into a filtered table in Excel you will know that this does not work as expected. Excel ignores the filtering and just pastes in the data starting from the first row selected. All the work arounds I’ve ever found didn’t work for me. The only guaranteed method is to sort the data and then manually find the rows where you want to paste in.

Thankfully, in R, it is very easy to complete this task.

## Create data frame
a <- rep(LETTERS[1:5], times = 5)
b <- rep(NA, each = 25)
df <- data.frame(a, b)
## Add values to b (column 2) where a is equal to "B"
df[a == "B",2] <- c(6,7,8,9,0)

Mapping in R

I’ve been having great fun getting to grips with drawing maps showing data in R. There is a good tutorial here and extra information here and here.

I decided to use the Royal Society for the Protection of Birds (RSPB)  2015 Big Garden Birdwatch. It took some time to align the data they give for “counties” with something I could find maps for. It therefore won’t be completely accurate especially as the result for a region is the simple mean of the sub regions.

It was a fun learning exercise though. Here are a couple of maps.


R for Excel Users – Part 3 – Merging data

In the last part of this series we looked at an R equivalent to VLOOKUP in Excel. Another use of VLOOKUP is to match data on a one-to-one basis. For example, inserting the price of a stock from a master prices table into a table of stock holdings. I can’t believe how easy this is to do in R!

Here are two data frames. One contains the phonetic alphabet and the other contains a message.

> head(phonetic)
  Letter    Code
1      A   Alpha
2      B   Bravo
3      C Charlie
4      D   Delta
5      E    Echo
6      F Foxtrot
> message
   No Letter
1   1      R
2   2      I
3   3      S
4   4      A
5   5      M
6   6      A
7   7      Z
8   8      I
9   9      N
10 10      G

To find the phonetic pronunciation of each letter in the message, all we need to do is to combine the two data frames using merge, ensuring that there is a column by the same name in each data frame. Unfortunately you can either sort it by the column you are using to combine by (“Letter” in this case) or not at all (ie random, or an anagram for free!) To get round this, I sorted  by an extra ID column (called “No”) and then deleted it afterwards.

> result <- merge(message, phonetic, by = "Letter")
> result <- result[order(result$No),]
> result[,2] = NULL
> result
   Letter     Code
8       R    Romeo
4       I    India
9       S   Sierra
2       A    Alpha
6       M     Mike
1       A    Alpha
10      Z     Zulu
5       I    India
7       N November
3       G     Golf

This post from helped me with the sorting idea and also describes an alternative approach using join from the {plyr} library. I know I need to get into {plyr} and it’s on my list of “known unknowns” for my journey to being competent in R.

Evernote’s OCR

Ray Crowley reminded me about how amazing Evernote’s OCR capabilities are.

It works so well because for each scanned word, it records all the possible different words it thinks it could be. This would obviously be completely impossible with traditional OCR.

If you want to take a look at the OCR in one of your notes it’s very simple.

  1. Copy the note link. It’s fairly easy to find e.g in the windows client, right click the note in the list view. It will look something like:
  2. You then need to put the last bit of the address (d3e3…..) after  to get
  3. Put the address in your browser and scroll through all the metadata for the note including a picture at the bottom


The possible “false positives” are a small price to pay for such an awesome system.