R for Excel Users – Part 2 – VLOOKUP

So the second in the series. I should just emphasise that this is not an introductory course in R for Excel users. There are lots of great “Introduction to R” courses. What I’m finding, as I make the transition from “Excel for everything” to using the best tool for the job, is that although I’m quite confident now in basic R, there are lots of things which I can do really quickly in Excel but I’m not sure how to do in R. The danger is that, in order to get something done, I end up doing it in Excel when in fact R would be the better tool. This series documents my findings when I have taken the trouble to do the research and figure out how to do it in R.

VLOOKUP has to be my favourite Excel function. I use it all the time. There are various ways of implementing vlookup in R. (See the general resources at the bottom of the first part of this series). I’m going to explain a specific example I used all the time in my previous existence as a teacher. When students do a test they get a numerical score (e.g a percentage) and this is then converted into a grade (A*, A, B etc) based on some defined grade boundaries. (A* is a grade above A in the English education system. U stands for unclassified – a politically correct term for “fail”).

Here are the scores, contained in a data frame called scores.

> scores
        Name   Score
1     Ann       23
2     Bethany   67
3     Charlie   42
4     David     73

Here are the grade boundaries in a data frame called grades. So to get a grade C, for example, you need 50 marks up to 59 (assuming we are working in whole numbers). If you get 60, then it’s a B.

> grades
  Score Grade
1     0     U
2    40     D
3    50     C
4    60     B
5    70     A
6    80     A*

The critical command we need is findinterval. Findinterval needs a minimum of two arguments; what we want to find and the intervals we want to find it in. Our grades data table contains a total of 7 intervals with indices 0 to 6. There is one interval for everything below 0 (not relevant in our case – none of my students were that weak!), one for everything above 80 and five between the 6 numbers in our table.

Intervals

Findinterval returns the interval number:

> findInterval(scores$Score,grades$Score)
[1] 1 4 2 5

So, Ann’s score of 23 is between 0 and 40 which is interval number 1, while David’s score of 73 is between 70 and 80, which is interval number 5. We can then use these numbers, in square brackets, to pull out the correct grade letter from our grades data frame.

Grade <- grades$Grade[findInterval(scores$Score,grades$Score)]

Combining this new vector with our scores data frame using cbind, gives us our new scores data frame, with the grade as well as the score.

> scores  <- cbind(scores,Grade)
> scores
      Name      Score  Grade
1     Ann       23      U
2     Bethany   67      B
3     Charlie   42      D
4     David     73      A

Postscript

As an aside, if in your example the 0 interval (ie below 0% in my case) is meaningful, you would need a slightly different approach. Here’s some ideas to start you off.  One way would be to avoid the problem altogether by making sure that the first number in the lookup table really is less than the minimum you could possibly have (-10 million or whatever). A more elegant solution would be to separate the grades data frame into separate vectors of numbers and grades, with one more entry in the grades vector than the numbers vector. Then add one to the result of findinterval.

> Score <- c(0, 40, 50 , 60 , 70 , 80)
> Grade <- c("Nonsense!", "U", "D", "C", "B", "A","A*")
> Grade[findInterval(-10,Score)+1]
[1] "Nonsense!"
> Grade[findInterval(scores$Score,Score)+1]
[1] "U" "B" "D" "A"

R for Excel Users – Part 1 – Min and Max

I would describe myself as a competent Excel user but very much a beginner in R. There are lots of parallels between the two programs but lots of differences too. In this series I’m going to describe how to do some routine Excel tasks in R and any major differences between the two.

There are some great general resources available for people making the transition and I’ve listed some useful ones at the end.

So let’s start with min and max. Min and max themselves just return the minimum or maximum value of the arguments – numbers, vectors etc.

> max(c(105, 120, 230, 220) , 100)
[1] 230

To compare the values in two vectors in a pairwise fashion we need to use pmin and pmax.

> ref  <- c(105, 120, 230, 220)
> current <- c(100, 120,235,100) 
> pmin(ref,current) 
[1] 100 120 230 100 

This leads to something that is very easy to do in R but difficult in Excel because you need to overwrite existing data (I guess you would use a macro). I have a vector of reference stock prices called ref and a vector of current prices called current. I want to update the reference price to be the maximum of the reference price and the current price.

> ref <- c(105, 120, 230, 220)
> current <- c(100, 120,235,100)
> ref <- pmax(ref, current)
> ref
[1] 105 120 235 220

Elements 1,2 and 4 stay the same but element 3 has been updated with the higher price.

General resources about R for Excel users:

Importing Wikipedia Tables into R – A Beginners Guide

Sooner or later in your discovery of R, you are going to want to answer a question based on data from Wikipedia. The question that struck me, as I was listening to the Today Programme on Radio 4, was, “how big are parliamentary constituencies and how much variation is there between them?”

The data was easy to find on Wikipedia – now how to get it into R? Now you could of course just cut and paste it into Excel, tidy it up, save it as a CSV and read it into R. As a learning exercise though I wanted to be able to cope with future tables that are always changing, and the Excel approach becomes tedious if you have to do it a lot.

Google gave me quite a bit of information. The first problem is that Wikipedia is an https site so you need some extra gubbins in the url. Then you need to find the position of table on the page in order to read it. Trial and error works fine and “which = 2” did the job in this case. I then tidied up the column names.

library(RCurl)
library(XML)
theurl <- getURL("https://en.wikipedia.org/wiki/List_of_United_Kingdom_Parliament_constituencies",.opts = list(ssl.verifypeer = FALSE))
my.DF <- readHTMLTable(theurl, which= 2, header = TRUE, stringsAsFactors = FALSE)
colnames(my.DF) <- c("Constituency" ,  "Electorate_2000","Electorate_2010", "LargestLocalAuthority", "Country")
head(my.DF)

              Constituency Electorate_2000 Electorate_2010 LargestLocalAuthority Country
1                Aldershot          66,499          71,908             Hampshire England
2      Aldridge-Brownhills          58,695          59,506         West Midlands England
3 Altrincham and Sale West          69,605          72,008    Greater Manchester England
4             Amber Valley          66,406          69,538            Derbyshire England
5  Arundel and South Downs          71,203          76,697           West Sussex England
6                 Ashfield          74,674          77,049       Nottinghamshire England
 

It looks fine. As expected,  it’s a data frame with 650 observations and 5 variables. The problem is that the commas in the numbers has made R interpret them as text so the class of the two numeric variables is “chr” not “num”. A simple as.numeric doesn’t work. You to have remove the comma first, which you do with gsub.

my.DF$Electorate_2000 <- as.numeric(gsub(",","", my.DF$Electorate_2000))
my.DF$Electorate_2010 <- as.numeric(gsub(",","", my.DF$Electorate_2010))

So now I can answer the question.

summary(my.DF$Electorate_2010)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  21840   66290   71310   70550   75320  110900 
hist(my.DF$Electorate_2010, breaks = 20, main = "Size of UK Parliamentary Constiuencies", xlab = "Number of Voters",xlim = c(20000,120000) )

Answer: About 70 thousand with half of them between 66 and 75 thousand.Constituencies

I’ve tried this approach on a couple of other tables and it seems to work OK, so good luck with your Wikipedia data scraping.

Keeping track of investments – some tricks

Update. As of 21st November 2015 this seems to have stopped working. I don’t know whether it is a change to IFTTT or FE Trustnet. I am still investigating.

It’s obviously important to keep track of how your investments are doing, unless this leads you to over react and then trade too much, in which case just ignore them! If you invest online each of the platforms provides their own dashboard which tracks the value of your holdings. If you use a number of providers or have more than one account (e.g. ISA and non-ISA) it becomes tedious to monitor them all. There are some great free website which allow you to add all your holdings and then monitor the value. My favourite (I’m in the UK) is Trustnet. It does all the things you would expect it to do. My only real criticism is that it can be slow to update prices (particularly for investment trusts) so the data is only correct by about 8:30 in the evening. The other minor gripe is that they keep changing the day that they send an (optional) weekly summary by e-mail. I use this data to update my master spreadsheet.

One of the things I really like about Trustnet is the ability to download the portfolio in .csv format which is read by all spreadsheet programs. The problem caused by the moving weekly e-mail led me to a solution based on automatically downloading the csv file.

IFTTT is a great service for linking together different apps and websites. For example every time I favorite a tweet, IFTTT automatically sends it to Evernote.  I’ve now created a recipe which downloads my investments csv file to Dropbox every Saturday morning. Here are the steps to set it up. You need to register for IFTTT. The “channels” you will need are “Date and Time”  and Dropbox.  I assume you already have a Dropbox account, but if you don’t, it is easy to set up.

IFTTT stands for If This Then That. In our case “if this” is a day and time and “then that” is download a file to Dropbox. The process for creating a new recipe is fairly intuitive so I won’t give a lot of detail here, but here are the key steps.

  1. Choose “My Recipes” from the top of the page and then “Create a Recipe”
  2. Then click on the blue “This” and then choose “Date and time” ScreenShot218
  3. You then choose a trigger based on how often you want to download the file. I use “Every day of the week at”. Then complete the trigger feeds according to your needs and click “Create Trigger”.
  4. Now for the clever bit. Click the blue “that” and then choose the familiar Dropbox icon. Click “add file from url”.
  5. There are now three boxes to complete. The first one is the tricky one. If you download data from Trustnet it doesn’t make it obvious what the url is that you are downloading from. The url is actually of the form: http://www.trustnet.com/Tools/Portfolio/ExportPortfolio.aspx?PortfolioId=xxx where xxx is a unique number for the portfolio. To find out this number you need to go to the Trustnet portfolio page:ScreenShot219 Right click the Portfolio name you are interested in and copy link address (That’s the wording in Chrome other browsers may differ). Paste the link somewhere like notepad and you should have: javascript:goToPage(‘/Tools/Portfolio/Valuation.aspx’,’1234567890′); The number is what you need. Replace xxx above with this number.
  6. So to summarise, you will add
    http://www.trustnet.com/Tools/Portfolio/ExportPortfolio.aspx?PortfolioId=1234567890 to the File URL box.
  7. Click in the file name box and click the conical flask icon. Choose “CheckTime” so each download is named including the date and time. Stick with the default folder path or create your own. (Mine is just Portfolio). Click create recipe and you are done.
  8. Unfortunately you will need to manually change the file extension from .aspx to .csv for each download.

So that’s it. Every week (or whatever frequency you have chosen) a file will appear in your Dropbox documenting the current state of your investments.

Aggregate and Summary in R – A Powerful combination but Weird!

The aggregate function in R is a really useful way of summarizing data based on different criteria. Continuing with my weather theme, I wanted to investigate the relationship between the temperature and the different observed weather types. Aggregate did  a great job of summarizing the data:

> aggregate(FinalDoc$T, by= list(FinalDoc$Descrip), FUN=summary, na.rm=TRUE)
                     Group.1 x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
1                Clear night  6.700     9.050   10.800 10.740    12.250 15.900
2                     Cloudy  5.800    12.920   15.400 15.110    17.420 22.000
3                    Drizzle 10.500    10.580   10.650 10.650    10.730 10.800
4                        Fog  3.800     3.800    3.800  3.800     3.800  3.800
5                 Heavy rain 10.000    10.500   11.400 11.840    12.850 14.600
6    Heavy rain shower (day) 10.400    11.300   12.200 13.000    14.300 16.400
7                 Light rain  9.800    11.950   12.700 12.980    13.850 18.700
8    Light rain shower (day) 10.200    12.450   13.200 14.340    16.000 20.100
9  Light rain shower (night) 10.200    13.400   14.500 13.600    14.600 15.300
10                      Mist  4.500     6.500   10.200  9.946    12.600 15.900
11                  Overcast  6.500    12.750   15.000 14.500    16.450 19.900
12       Partly cloudy (day) 10.400    15.750   17.600 17.380    19.250 22.100
13     Partly cloudy (night)  5.500     8.850   11.500 10.750    12.400 15.300
14                 Sunny day  7.400    13.300   15.200 14.920    17.500 21.200

However, the column headings are not great. “Group.1” is not very informative and having “x.” in front of the all statistics names is just plain annoying. “Well”, I thought, “it’s just a data frame, I can change the column names easily enough”. How wrong I was! If you look at the structure of the object it turns out to be a list consisting of Group.1,  which is a vector of factors and  x, which is a matrix containing the numbers.

> WeatherT <- aggregate(FinalDoc$T, by= list(FinalDoc$Descrip), FUN=summary, na.rm=TRUE)
> str(WeatherT)
'data.frame':	14 obs. of  2 variables:
 $ Group.1: Factor w/ 32 levels "Clear night",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ x      : num [1:14, 1:6] 6.7 5.8 10.5 3.8 10 10.4 9.8 10.2 10.2 4.5 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr  "Min." "1st Qu." "Median" "Mean" ...

Printing the matrix, x, gets rid of all the irritating x.’s



> WeatherT$x
       Min 1st Qu Median   Mean 3rd Qu  Max
 [1,]  6.7   9.05  10.80 10.740  12.25 15.9
 [2,]  5.8  12.92  15.40 15.110  17.42 22.0
 [3,] 10.5  10.58  10.65 10.650  10.73 10.8
 [4,]  3.8   3.80   3.80  3.800   3.80  3.8
 [5,] 10.0  10.50  11.40 11.840  12.85 14.6
 [6,] 10.4  11.30  12.20 13.000  14.30 16.4
 [7,]  9.8  11.95  12.70 12.980  13.85 18.7
 [8,] 10.2  12.45  13.20 14.340  16.00 20.1
 [9,] 10.2  13.40  14.50 13.600  14.60 15.3
[10,]  4.5   6.50  10.20  9.946  12.60 15.9
[11,]  6.5  12.75  15.00 14.500  16.45 19.9
[12,] 10.4  15.75  17.60 17.380  19.25 22.1
[13,]  5.5   8.85  11.50 10.750  12.40 15.3
[14,]  7.4  13.30  15.20 14.920  17.50 21.2

So from there it was just a matter of pulling out the matrix and giving it row names.

> WeatherT2  <- WeatherT$x
> rownames(WeatherT2) <- WeatherT$Group.1
> WeatherT2
                           Min 1st Qu Median   Mean 3rd Qu  Max
Clear night                6.7   9.05  10.80 10.740  12.25 15.9
Cloudy                     5.8  12.92  15.40 15.110  17.42 22.0
Drizzle                   10.5  10.58  10.65 10.650  10.73 10.8
Fog                        3.8   3.80   3.80  3.800   3.80  3.8
Heavy rain                10.0  10.50  11.40 11.840  12.85 14.6
Heavy rain shower (day)   10.4  11.30  12.20 13.000  14.30 16.4
Light rain                 9.8  11.95  12.70 12.980  13.85 18.7
Light rain shower (day)   10.2  12.45  13.20 14.340  16.00 20.1
Light rain shower (night) 10.2  13.40  14.50 13.600  14.60 15.3
Mist                       4.5   6.50  10.20  9.946  12.60 15.9
Overcast                   6.5  12.75  15.00 14.500  16.45 19.9
Partly cloudy (day)       10.4  15.75  17.60 17.380  19.25 22.1
Partly cloudy (night)      5.5   8.85  11.50 10.750  12.40 15.3
Sunny day                  7.4  13.30  15.20 14.920  17.50 21.2

That looks much better. The other advantage is that, now it’s a simple matrix, it is much easier to reference the data.

> WeatherT2[2,3]
[1] 15.4

The overall message for me? In R,  if it looks like a duck, swims like a duck, and quacks like a duck, then it may not be a duck. Always use str to check, and if you get…

> str(object)
 Named duck [1:32] 38 240 2 1 19 3 46 12 5 13 ...
 - attr(*, "names")= chr [1:32] "Mallard" "Mandarin" "Tufted" "Jemima Puddle" ...

it’s time to take a break!

Turbocharge Dropbox with IFTTT and then Analyse the Results in R

I’ve been learning a lot about the R statistical programming language. It’s a great (free) tool to process data, analyze it and produce great graphics. I’m only just beginning, and I feel I’m scratching the surface, but so far it has been fun.

 I was interested in gathering data about my local weather (yes, I know, a British obsession). The UK Met Office publishes a lot of data (which you can access with a free DataPoint account), but the level of detail I’m interested in is only available for the last 24 hours. Now I’m sure that if you worked for a company you would have servers on 24-7, so you could write some sort of script to download the data every day. For the hobbyist (me) or student, this is not an option. Enter IFTTT.
IFTTT is an app which connects lots of different services together. For example, I have an IFTTT recipe that sends a copy of any Tweet that I favorite to my Evernote account.  There is also a generic timer which allows you to trigger actions at a set time. It was dead easy to set this up to download the xml file that I need from the met office to Dropbox at 1:00 am every morning. Next time I switch on my computer, Dropbox makes sure that the file(s) get synced to my hard drive.  I have written an R script, which I can run at my convenience, which reads the files into R and puts the data into a data frame – definitely my toughest R challenge to date! Then I can play to my hearts content!
TempByHourWindPolar
An example of an xml file for 24 (well actually 25 in this case) hours of data. Location details redacted.
&lt;SiteRep&gt;
&lt;Wx&gt;
&lt;Param name ="G" units="mph"&gt; Wind Gust&lt;/Param&gt;
&lt;Param name ="T" units="C"&gt; Temperature&lt;/Param&gt;
&lt;Param name ="V" units="m"&gt; Visibility&lt;/Param&gt;
&lt;Param name ="D" units="compass"&gt; Wind Direction&lt;/Param&gt;
&lt;Param name ="S" units="mph"&gt; Wind Speed&lt;/Param&gt;
&lt;Param name ="W" units=""&gt; Weather Type&lt;/Param&gt;
&lt;Param name ="P" units="hpa"&gt; Pressure&lt;/Param&gt;
&lt;Param name ="Pt" units="Pa/s"&gt; Pressure Tendency&lt;/Param&gt;
&lt;Param name ="Dp" units="C"&gt; Dew Point&lt;/Param&gt;
&lt;Param name ="H" units="%"&gt; Screen Relative Humidity&lt;/Param&gt;
&lt;/Wx&gt;
&lt;DV dataDate ="2015-07-17T23:00:00Z" type="Obs" &gt;
&lt;Location i ="xxxx" lat="xxx" lon="xxxx" name ="xxxx" country="ENGLAND" continent ="EUROPE" elevation= "32.0"&gt;
&lt;Period type ="Day" value="2015-07-17Z"&gt;
&lt;Rep D ="SSE" H="82.6" P= "1012" S ="5" T="13.5" V= "30000" W ="7" Pt="F" Dp="10.6" &gt;0&lt;/Rep&gt;
&lt;Rep D ="S" H="91.1" P= "1011" S ="7" T="12.3" V= "20000" W ="7" Pt="F" Dp="10.9" &gt;60&lt;/Rep&gt;
&lt;Rep D ="SE" H="90.6" P= "1010" S ="6" T="13.3" V= "9000" W ="7" Pt="F" Dp="11.8" &gt;120&lt;/Rep&gt;
&lt;Rep D ="N" H="94.2" P= "1009" S ="2" T="12.8" V= "7000" W ="8" Pt="F" Dp="11.9" &gt;180&lt;/Rep&gt;
&lt;Rep D ="W" H="94.3" P= "1008" S ="1" T="13.1" V= "8000" W ="8" Pt="F" Dp="12.2" &gt;240&lt;/Rep&gt;
&lt;Rep D ="SE" H="93.7" P= "1007" S ="8" T="14.6" V= "6000" W ="12" Pt="F" Dp="13.6" &gt;300&lt;/Rep&gt;
&lt;Rep D ="SSE" H="95.5" P= "1007" S ="10" T="14.6" V= "4500" W ="15" Pt="F" Dp="13.9" &gt;360&lt;/Rep&gt;
&lt;Rep D ="SSE" H="96.8" P= "1006" S ="10" T="15.0" V= "3500" W ="8" Pt="F" Dp="14.5" &gt;420&lt;/Rep&gt;
&lt;Rep D ="SSE" H="95.6" P= "1006" S ="13" T="15.9" V= "4000" W ="5" Pt="F" Dp="15.2" &gt;480&lt;/Rep&gt;
&lt;Rep D ="SSE" H="85.8" P= "1006" S ="13" T="17.5" V= "7000" W ="7" Pt="F" Dp="15.1" &gt;540&lt;/Rep&gt;
&lt;Rep D ="SW" H="81.5" P= "1006" S ="9" T="18.1" V= "20000" W ="7" Pt="F" Dp="14.9" &gt;600&lt;/Rep&gt;
&lt;Rep D ="WSW" H="68.9" P= "1006" S ="21" T="18.6" V= "30000" W ="7" Pt="R" Dp="12.8" &gt;660&lt;/Rep&gt;
&lt;Rep D ="SW" G="34" H= "62.4" P ="1006" S="19" T= "18.6" V ="40000" W="3" Pt="R" Dp ="11.3"&gt; 720&lt;/Rep&gt;
&lt;Rep D ="SW" G="30" H= "52.4" P ="1007" S="19" T= "20.0" V ="40000" W="7" Pt="R" Dp ="10.0"&gt; 780&lt;/Rep&gt;
&lt;Rep D ="SW" G="33" H= "52.4" P ="1007" S="23" T= "20.1" V ="40000" W="3" Pt="R" Dp ="10.1"&gt; 840&lt;/Rep&gt;
&lt;Rep D ="SW" G="33" H= "53.4" P ="1007" S="21" T= "19.8" V ="30000" W="7" Pt="R" Dp ="10.1"&gt; 900&lt;/Rep&gt;
&lt;Rep D ="WSW" G="33" H= "50.2" P ="1007" S="22" T= "18.5" V ="40000" W="3" Pt="R" Dp ="8.0"&gt; 960&lt;/Rep&gt;
&lt;Rep D ="SW" G="31" H= "48.8" P ="1007" S="18" T= "18.4" V ="40000" W="3" Pt= "R" Dp ="7.5"&gt; 1020&lt;/Rep&gt;
&lt;Rep D ="WSW" G="30" H= "53.0" P ="1008" S="18" T= "16.9" V ="40000" W="7" Pt= "R" Dp ="7.3"&gt; 1080&lt;/Rep&gt;
&lt;Rep D ="WSW" G="32" H= "57.4" P ="1008" S="17" T= "16.3" V ="40000" W="1" Pt="R" Dp ="7.9"&gt; 1140&lt;/Rep&gt;
&lt;Rep D ="SW" G="30" H= "64.9" P ="1008" S="14" T= "14.9" V ="40000" W="1" Pt="R" Dp ="8.4"&gt; 1200&lt;/Rep&gt;
&lt;Rep D ="SW" G="30" H= "67.5" P ="1009" S="18" T= "14.3" V ="40000" W="0" Pt="R" Dp ="8.4"&gt; 1260&lt;/Rep&gt;
&lt;Rep D ="SW" H="71.2" P= "1009" S ="15" T="13.6" V= "40000" W ="0" Pt="R" Dp="8.5" &gt;1320&lt;/Rep&gt;
&lt;Rep D ="SW" H="75.5" P= "1009" S ="13" T="13.1" V= "40000" W ="0" Pt="R" Dp="8.9" &gt;1380&lt;/Rep&gt;
&lt;/Period&gt;
&lt;Period type ="Day" value="2015-07-18Z"&gt;
&lt;Rep D ="SW" H="73.4" P= "1009" S ="10" T="12.6" V= "40000" W ="0" Pt="R" Dp="8.0" &gt;0&lt;/Rep&gt;
&lt;/Period&gt;
&lt;/Location&gt;
&lt;/DV&gt;
&lt;/SiteRep&gt;