Sooner or later in your discovery of R, you are going to want to answer a question based on data from Wikipedia. The question that struck me, as I was listening to the Today Programme on Radio 4, was, “how big are parliamentary constituencies and how much variation is there between them?”
The data was easy to find on Wikipedia – now how to get it into R? Now you could of course just cut and paste it into Excel, tidy it up, save it as a CSV and read it into R. As a learning exercise though I wanted to be able to cope with future tables that are always changing, and the Excel approach becomes tedious if you have to do it a lot.
Google gave me quite a bit of information. The first problem is that Wikipedia is an https site so you need some extra gubbins in the url. Then you need to find the position of table on the page in order to read it. Trial and error works fine and “which = 2” did the job in this case. I then tidied up the column names.
library(RCurl) library(XML) theurl <- getURL("https://en.wikipedia.org/wiki/List_of_United_Kingdom_Parliament_constituencies",.opts = list(ssl.verifypeer = FALSE)) my.DF <- readHTMLTable(theurl, which= 2, header = TRUE, stringsAsFactors = FALSE) colnames(my.DF) <- c("Constituency" , "Electorate_2000","Electorate_2010", "LargestLocalAuthority", "Country") head(my.DF) Constituency Electorate_2000 Electorate_2010 LargestLocalAuthority Country 1 Aldershot 66,499 71,908 Hampshire England 2 Aldridge-Brownhills 58,695 59,506 West Midlands England 3 Altrincham and Sale West 69,605 72,008 Greater Manchester England 4 Amber Valley 66,406 69,538 Derbyshire England 5 Arundel and South Downs 71,203 76,697 West Sussex England 6 Ashfield 74,674 77,049 Nottinghamshire England
It looks fine. As expected, it’s a data frame with 650 observations and 5 variables. The problem is that the commas in the numbers has made R interpret them as text so the class of the two numeric variables is “chr” not “num”. A simple as.numeric doesn’t work. You to have remove the comma first, which you do with gsub.
my.DF$Electorate_2000 <- as.numeric(gsub(",","", my.DF$Electorate_2000)) my.DF$Electorate_2010 <- as.numeric(gsub(",","", my.DF$Electorate_2010))
So now I can answer the question.
summary(my.DF$Electorate_2010) Min. 1st Qu. Median Mean 3rd Qu. Max. 21840 66290 71310 70550 75320 110900 hist(my.DF$Electorate_2010, breaks = 20, main = "Size of UK Parliamentary Constiuencies", xlab = "Number of Voters",xlim = c(20000,120000) )
I’ve tried this approach on a couple of other tables and it seems to work OK, so good luck with your Wikipedia data scraping.