R for Excel Users – Part 4 – Pasting into a filtered table

If you have ever tried to paste into a filtered table in Excel you will know that this does not work as expected. Excel ignores the filtering and just pastes in the data starting from the first row selected. All the work arounds I’ve ever found didn’t work for me. The only guaranteed method is to sort the data and then manually find the rows where you want to paste in.

Thankfully, in R, it is very easy to complete this task.

## Create data frame
a <- rep(LETTERS[1:5], times = 5)
b <- rep(NA, each = 25)
df <- data.frame(a, b)
 
## Add values to b (column 2) where a is equal to "B"
df[a == "B",2] <- c(6,7,8,9,0)
df
Advertisements

Mapping in R

I’ve been having great fun getting to grips with drawing maps showing data in R. There is a good tutorial here and extra information here and here.

I decided to use the Royal Society for the Protection of Birds (RSPB)  2015 Big Garden Birdwatch. It took some time to align the data they give for “counties” with something I could find maps for. It therefore won’t be completely accurate especially as the result for a region is the simple mean of the sub regions.

It was a fun learning exercise though. Here are a couple of maps.

GrtTitsLvl2GrtTitsLvl3

R for Excel Users – Part 3 – Merging data

In the last part of this series we looked at an R equivalent to VLOOKUP in Excel. Another use of VLOOKUP is to match data on a one-to-one basis. For example, inserting the price of a stock from a master prices table into a table of stock holdings. I can’t believe how easy this is to do in R!

Here are two data frames. One contains the phonetic alphabet and the other contains a message.

> head(phonetic)
  Letter    Code
1      A   Alpha
2      B   Bravo
3      C Charlie
4      D   Delta
5      E    Echo
6      F Foxtrot
> message
   No Letter
1   1      R
2   2      I
3   3      S
4   4      A
5   5      M
6   6      A
7   7      Z
8   8      I
9   9      N
10 10      G

To find the phonetic pronunciation of each letter in the message, all we need to do is to combine the two data frames using merge, ensuring that there is a column by the same name in each data frame. Unfortunately you can either sort it by the column you are using to combine by (“Letter” in this case) or not at all (ie random, or an anagram for free!) To get round this, I sorted  by an extra ID column (called “No”) and then deleted it afterwards.

> result <- merge(message, phonetic, by = "Letter")
> result <- result[order(result$No),]
> result[,2] = NULL
> result
   Letter     Code
8       R    Romeo
4       I    India
9       S   Sierra
2       A    Alpha
6       M     Mike
1       A    Alpha
10      Z     Zulu
5       I    India
7       N November
3       G     Golf

This post from r-statistics.com helped me with the sorting idea and also describes an alternative approach using join from the {plyr} library. I know I need to get into {plyr} and it’s on my list of “known unknowns” for my journey to being competent in R.

Evernote’s OCR

Ray Crowley reminded me about how amazing Evernote’s OCR capabilities are.

It works so well because for each scanned word, it records all the possible different words it thinks it could be. This would obviously be completely impossible with traditional OCR.

If you want to take a look at the OCR in one of your notes it’s very simple.

  1. Copy the note link. It’s fairly easy to find e.g in the windows client, right click the note in the list view. It will look something like: https://www.evernote.com/shard/s317/nl/41374917/d3e3b6b4-7716-473d-8978-e0000008e7ea
  2. You then need to put the last bit of the address (d3e3…..) after https://www.evernote.com/debug/  to get https://www.evernote.com/debug/d3e3b6b4-7716-473d-8978-e0000008e7ea
  3. Put the address in your browser and scroll through all the metadata for the note including a picture at the bottom

ScreenShot223

The possible “false positives” are a small price to pay for such an awesome system.

R for Excel Users – Part 2 – VLOOKUP

So the second in the series. I should just emphasise that this is not an introductory course in R for Excel users. There are lots of great “Introduction to R” courses. What I’m finding, as I make the transition from “Excel for everything” to using the best tool for the job, is that although I’m quite confident now in basic R, there are lots of things which I can do really quickly in Excel but I’m not sure how to do in R. The danger is that, in order to get something done, I end up doing it in Excel when in fact R would be the better tool. This series documents my findings when I have taken the trouble to do the research and figure out how to do it in R.

VLOOKUP has to be my favourite Excel function. I use it all the time. There are various ways of implementing vlookup in R. (See the general resources at the bottom of the first part of this series). I’m going to explain a specific example I used all the time in my previous existence as a teacher. When students do a test they get a numerical score (e.g a percentage) and this is then converted into a grade (A*, A, B etc) based on some defined grade boundaries. (A* is a grade above A in the English education system. U stands for unclassified – a politically correct term for “fail”).

Here are the scores, contained in a data frame called scores.

> scores
        Name   Score
1     Ann       23
2     Bethany   67
3     Charlie   42
4     David     73

Here are the grade boundaries in a data frame called grades. So to get a grade C, for example, you need 50 marks up to 59 (assuming we are working in whole numbers). If you get 60, then it’s a B.

> grades
  Score Grade
1     0     U
2    40     D
3    50     C
4    60     B
5    70     A
6    80     A*

The critical command we need is findinterval. Findinterval needs a minimum of two arguments; what we want to find and the intervals we want to find it in. Our grades data table contains a total of 7 intervals with indices 0 to 6. There is one interval for everything below 0 (not relevant in our case – none of my students were that weak!), one for everything above 80 and five between the 6 numbers in our table.

Intervals

Findinterval returns the interval number:

> findInterval(scores$Score,grades$Score)
[1] 1 4 2 5

So, Ann’s score of 23 is between 0 and 40 which is interval number 1, while David’s score of 73 is between 70 and 80, which is interval number 5. We can then use these numbers, in square brackets, to pull out the correct grade letter from our grades data frame.

Grade <- grades$Grade[findInterval(scores$Score,grades$Score)]

Combining this new vector with our scores data frame using cbind, gives us our new scores data frame, with the grade as well as the score.

> scores  <- cbind(scores,Grade)
> scores
      Name      Score  Grade
1     Ann       23      U
2     Bethany   67      B
3     Charlie   42      D
4     David     73      A

Postscript

As an aside, if in your example the 0 interval (ie below 0% in my case) is meaningful, you would need a slightly different approach. Here’s some ideas to start you off.  One way would be to avoid the problem altogether by making sure that the first number in the lookup table really is less than the minimum you could possibly have (-10 million or whatever). A more elegant solution would be to separate the grades data frame into separate vectors of numbers and grades, with one more entry in the grades vector than the numbers vector. Then add one to the result of findinterval.

> Score <- c(0, 40, 50 , 60 , 70 , 80)
> Grade <- c("Nonsense!", "U", "D", "C", "B", "A","A*")
> Grade[findInterval(-10,Score)+1]
[1] "Nonsense!"
> Grade[findInterval(scores$Score,Score)+1]
[1] "U" "B" "D" "A"

R for Excel Users – Part 1 – Min and Max

I would describe myself as a competent Excel user but very much a beginner in R. There are lots of parallels between the two programs but lots of differences too. In this series I’m going to describe how to do some routine Excel tasks in R and any major differences between the two.

There are some great general resources available for people making the transition and I’ve listed some useful ones at the end.

So let’s start with min and max. Min and max themselves just return the minimum or maximum value of the arguments – numbers, vectors etc.

> max(c(105, 120, 230, 220) , 100)
[1] 230

To compare the values in two vectors in a pairwise fashion we need to use pmin and pmax.

> ref  <- c(105, 120, 230, 220)
> current <- c(100, 120,235,100) 
> pmin(ref,current) 
[1] 100 120 230 100 

This leads to something that is very easy to do in R but difficult in Excel because you need to overwrite existing data (I guess you would use a macro). I have a vector of reference stock prices called ref and a vector of current prices called current. I want to update the reference price to be the maximum of the reference price and the current price.

> ref <- c(105, 120, 230, 220)
> current <- c(100, 120,235,100)
> ref <- pmax(ref, current)
> ref
[1] 105 120 235 220

Elements 1,2 and 4 stay the same but element 3 has been updated with the higher price.

General resources about R for Excel users:

Importing Wikipedia Tables into R – A Beginners Guide

Sooner or later in your discovery of R, you are going to want to answer a question based on data from Wikipedia. The question that struck me, as I was listening to the Today Programme on Radio 4, was, “how big are parliamentary constituencies and how much variation is there between them?”

The data was easy to find on Wikipedia – now how to get it into R? Now you could of course just cut and paste it into Excel, tidy it up, save it as a CSV and read it into R. As a learning exercise though I wanted to be able to cope with future tables that are always changing, and the Excel approach becomes tedious if you have to do it a lot.

Google gave me quite a bit of information. The first problem is that Wikipedia is an https site so you need some extra gubbins in the url. Then you need to find the position of table on the page in order to read it. Trial and error works fine and “which = 2” did the job in this case. I then tidied up the column names.

library(RCurl)
library(XML)
theurl <- getURL("https://en.wikipedia.org/wiki/List_of_United_Kingdom_Parliament_constituencies",.opts = list(ssl.verifypeer = FALSE))
my.DF <- readHTMLTable(theurl, which= 2, header = TRUE, stringsAsFactors = FALSE)
colnames(my.DF) <- c("Constituency" ,  "Electorate_2000","Electorate_2010", "LargestLocalAuthority", "Country")
head(my.DF)

              Constituency Electorate_2000 Electorate_2010 LargestLocalAuthority Country
1                Aldershot          66,499          71,908             Hampshire England
2      Aldridge-Brownhills          58,695          59,506         West Midlands England
3 Altrincham and Sale West          69,605          72,008    Greater Manchester England
4             Amber Valley          66,406          69,538            Derbyshire England
5  Arundel and South Downs          71,203          76,697           West Sussex England
6                 Ashfield          74,674          77,049       Nottinghamshire England
 

It looks fine. As expected,  it’s a data frame with 650 observations and 5 variables. The problem is that the commas in the numbers has made R interpret them as text so the class of the two numeric variables is “chr” not “num”. A simple as.numeric doesn’t work. You to have remove the comma first, which you do with gsub.

my.DF$Electorate_2000 <- as.numeric(gsub(",","", my.DF$Electorate_2000))
my.DF$Electorate_2010 <- as.numeric(gsub(",","", my.DF$Electorate_2010))

So now I can answer the question.

summary(my.DF$Electorate_2010)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  21840   66290   71310   70550   75320  110900 
hist(my.DF$Electorate_2010, breaks = 20, main = "Size of UK Parliamentary Constiuencies", xlab = "Number of Voters",xlim = c(20000,120000) )

Answer: About 70 thousand with half of them between 66 and 75 thousand.Constituencies

I’ve tried this approach on a couple of other tables and it seems to work OK, so good luck with your Wikipedia data scraping.