Importing Wikipedia Tables into R – A Beginners Guide

Sooner or later in your discovery of R, you are going to want to answer a question based on data from Wikipedia. The question that struck me, as I was listening to the Today Programme on Radio 4, was, “how big are parliamentary constituencies and how much variation is there between them?”

The data was easy to find on Wikipedia – now how to get it into R? Now you could of course just cut and paste it into Excel, tidy it up, save it as a CSV and read it into R. As a learning exercise though I wanted to be able to cope with future tables that are always changing, and the Excel approach becomes tedious if you have to do it a lot.

Google gave me quite a bit of information. The first problem is that Wikipedia is an https site so you need some extra gubbins in the url. Then you need to find the position of table on the page in order to read it. Trial and error works fine and “which = 2” did the job in this case. I then tidied up the column names.

library(RCurl)
library(XML)
theurl <- getURL("https://en.wikipedia.org/wiki/List_of_United_Kingdom_Parliament_constituencies",.opts = list(ssl.verifypeer = FALSE))
my.DF <- readHTMLTable(theurl, which= 2, header = TRUE, stringsAsFactors = FALSE)
colnames(my.DF) <- c("Constituency" ,  "Electorate_2000","Electorate_2010", "LargestLocalAuthority", "Country")
head(my.DF)

              Constituency Electorate_2000 Electorate_2010 LargestLocalAuthority Country
1                Aldershot          66,499          71,908             Hampshire England
2      Aldridge-Brownhills          58,695          59,506         West Midlands England
3 Altrincham and Sale West          69,605          72,008    Greater Manchester England
4             Amber Valley          66,406          69,538            Derbyshire England
5  Arundel and South Downs          71,203          76,697           West Sussex England
6                 Ashfield          74,674          77,049       Nottinghamshire England
 

It looks fine. As expected,  it’s a data frame with 650 observations and 5 variables. The problem is that the commas in the numbers has made R interpret them as text so the class of the two numeric variables is “chr” not “num”. A simple as.numeric doesn’t work. You to have remove the comma first, which you do with gsub.

my.DF$Electorate_2000 <- as.numeric(gsub(",","", my.DF$Electorate_2000))
my.DF$Electorate_2010 <- as.numeric(gsub(",","", my.DF$Electorate_2010))

So now I can answer the question.

summary(my.DF$Electorate_2010)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  21840   66290   71310   70550   75320  110900 
hist(my.DF$Electorate_2010, breaks = 20, main = "Size of UK Parliamentary Constiuencies", xlab = "Number of Voters",xlim = c(20000,120000) )

Answer: About 70 thousand with half of them between 66 and 75 thousand.Constituencies

I’ve tried this approach on a couple of other tables and it seems to work OK, so good luck with your Wikipedia data scraping.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s