So the second in the series. I should just emphasise that this is not an introductory course in R for Excel users. There are lots of great “Introduction to R” courses. What I’m finding, as I make the transition from “Excel for everything” to using the best tool for the job, is that although I’m quite confident now in basic R, there are lots of things which I can do really quickly in Excel but I’m not sure how to do in R. The danger is that, in order to get something done, I end up doing it in Excel when in fact R would be the better tool. This series documents my findings when I have taken the trouble to do the research and figure out how to do it in R.
VLOOKUP has to be my favourite Excel function. I use it all the time. There are various ways of implementing vlookup in R. (See the general resources at the bottom of the first part of this series). I’m going to explain a specific example I used all the time in my previous existence as a teacher. When students do a test they get a numerical score (e.g a percentage) and this is then converted into a grade (A*, A, B etc) based on some defined grade boundaries. (A* is a grade above A in the English education system. U stands for unclassified – a politically correct term for “fail”).
Here are the scores, contained in a data frame called scores.
> scores Name Score 1 Ann 23 2 Bethany 67 3 Charlie 42 4 David 73
Here are the grade boundaries in a data frame called grades. So to get a grade C, for example, you need 50 marks up to 59 (assuming we are working in whole numbers). If you get 60, then it’s a B.
> grades Score Grade 1 0 U 2 40 D 3 50 C 4 60 B 5 70 A 6 80 A*
The critical command we need is findinterval. Findinterval needs a minimum of two arguments; what we want to find and the intervals we want to find it in. Our grades data table contains a total of 7 intervals with indices 0 to 6. There is one interval for everything below 0 (not relevant in our case – none of my students were that weak!), one for everything above 80 and five between the 6 numbers in our table.
Findinterval returns the interval number:
> findInterval(scores$Score,grades$Score) [1] 1 4 2 5
So, Ann’s score of 23 is between 0 and 40 which is interval number 1, while David’s score of 73 is between 70 and 80, which is interval number 5. We can then use these numbers, in square brackets, to pull out the correct grade letter from our grades data frame.
Grade <- grades$Grade[findInterval(scores$Score,grades$Score)]
Combining this new vector with our scores data frame using cbind, gives us our new scores data frame, with the grade as well as the score.
> scores <- cbind(scores,Grade) > scores Name Score Grade 1 Ann 23 U 2 Bethany 67 B 3 Charlie 42 D 4 David 73 A
Postscript
As an aside, if in your example the 0 interval (ie below 0% in my case) is meaningful, you would need a slightly different approach. Here’s some ideas to start you off. One way would be to avoid the problem altogether by making sure that the first number in the lookup table really is less than the minimum you could possibly have (-10 million or whatever). A more elegant solution would be to separate the grades data frame into separate vectors of numbers and grades, with one more entry in the grades vector than the numbers vector. Then add one to the result of findinterval.
> Score <- c(0, 40, 50 , 60 , 70 , 80) > Grade <- c("Nonsense!", "U", "D", "C", "B", "A","A*") > Grade[findInterval(-10,Score)+1] [1] "Nonsense!" > Grade[findInterval(scores$Score,Score)+1] [1] "U" "B" "D" "A"