G12SMM/MATH 2011 Statistical Models and Methods Linear Models, Assessed Coursework — 2019/2020 Please submit your work on Moodle as a pdf file by 3.00pm on Wednesday 13 May 2020. A link “Assessed Coursework Submission” will appear on Moodle in due course for you to do this. Your solutions should contain relevant R output needed to justify your answers/arguments, together with appropriate discussion, but please do not include pages of irrelevant plots/output which you do not discuss. The easiest way to include R output is to use R Markdown to produce your solutions, but you do not have to do so. You do not need to include your R code, though you can include it if you wish. If you are using R Markdown, and do not wish to include your R code, then you can suppress the R code using the echo = FALSE argument, i.e. enclose the code in an {r, echo=FALSE} environment in the Markdown file. There will be a Moodle forum specifically for answering queries about the coursework, so you may post questions and I will answer them there so that everyone receives the same assistance. Please be careful to not inadvertently give away parts of your answer if you do post a question. Note that as this is assessed work, I can only answer queries relating to clarification, and I will only answer queries via the forum so that everyone can see my responses. You can change your settings so that you get email notifications of new posts if you wish (I do not think that this is the default setting). Otherwise, please check the forum to see if your query has already been asked. Unauthorised late submission will be penalised by 5% of the full mark per day. Work submitted more than one week late will receive zero marks. You are reminded to familiarise yourself with the guidelines concerning plagiarism in assessed coursework (see the student handbook), and note that this applies equally to computer code as it does to written work. The work contributes 15% to the overall module mark. Please contact me if you have concerns about/problems with access to computing resources, including R access and installation. (This does not include actually using R for the analysis, as it is expected that you have developed the necessary skills through the computing classes and unassessed/practice coursework. Questions regarding the actual work should be posted on the forum.) I have made an additional document “R on your own machine” which is on Moodle, covering what you should need to complete this work. The Data You are a medical statistician who has been tasked with investigating associations between the birthweight of children and various potential explanatory variables. Data are available regarding the birthweight of 327 children, together with various other measurements. The data (referred to as the training data below) are contained in the file BirthTrain.txt on Moodle. The variables are: age Age of mother. gest Gestation period. sex Sex of child. smokes Whether the mother smoked during pregnancy, with levels ’No’, ’Light’ and ’Heavy’. weight Pre-pregnancy weight of mother. rate Rate of growth of child in the first trimester. bwt Birthweight of child. You can read the data into R (after saving the file in your working directory) using Births The variables ’smokes’ and ’sex’ should be treated as factors, the rest as numerical variables. After reading in the data, you should first check that R is treating each variable as intended, and change this behaviour if necessary. Interest lies in determining the variables associated with birthweight, which could then be inves- tigated further by medical professionals to understand any possible causal relationships. Additionally, the file BirthTest.txt contains the same measurements for a further 100 individ- uals. This is to be used for testing the predictive ability of models, and should not be used in any model development. This is referred to as the test data. The Task (a) Using only the training data, develop a model, or models, for assessing associations between birthweight (the response variable) and the other variables, and discuss your findings. See the notes below for what your analysis for this part should contain. [35] (b) Use your chosen “best” model(s) from (a) to predict the birthweight of the 100 individuals in the test set. Use appropriate numerical summaries/plots to evaluate the quality of your predictions. How do the predictions compare to those from the model of the form bwt = intercept + age + gest + sex + smokes + weight + rate? [15] Notes • For part (a), please structure your analysis as follows – An introduction and exploratory analysis, with appropriate plots and summaries which highlight important/interesting aspects of the data [10 marks]. – A description of your modelling process, showing how you arrive at your final chosen model(s) which best explain the data in a parsimonious way. Justification should involve use of appropriate tests/numerical measures. There may well be more than one good model [20 marks]. – A non-technical summary of your findings and conclusions, in a manner suitable for reporting back to medical professionals [5 marks]. Whilst the overall merit of the analysis will also be considered as a whole, around half the marks will be for doing technically correct and relevant things, and half for discussion and interpretation of the output. 2 • You do not need to (and should not) submit all the output corresponding to everything you do or try. For example, in the exploratory analysis, you may look at quite a number of different plots, and you might do quite a bit of experimentation in the model development stage. You only need to report the important plots/output which justify your decisions and conclusions, and whilst there is no word or page limit, an overly-verbose analysis with unneccessary output will detract from the analysis. • For the model fitting/selection, you can use any of the techniques we have covered this semester to investigate potential models — including the automated methods of Chapter 6/Case Study 9 and/or manual hypothesis testing. • Please make use of the help files for R commands. Many functions have optional arguments which might be useful. (This is a good general habit to get in to for future R use as well.) • You do not have to use the methods of Chapter 5, i.e. you do not have to do any transformations/diagnostic plots or assumption checking. However, you may do this if you wish and they could assist in model improvement, but you will not be penalised for not doing so. • For part (b), you should not be doing any additional model fitting. You are simply using your final model(s) from part (a) to make predictions of birthweight for the individuals in the test set, then comparing the predictions with the true known values. 3