Initial data analysis in the example of Pokémon species
CSTAT Data visualization seminar series
Dr. Lara Lusa, Natural Sciences and Information Technologies, University of Primorska and Institute for Biostatistics and Medical Informatics, University of Ljubljana, Slovenia
Initial Data Analysis (IDA) consists of all steps performed on the data of a study between the end of the data collection and start of statistical analyses that address research questions. The value of an effective IDA strategy for data analysts lies in ensuring that data are of sufficient quality, that model assumptions made in the analysis strategy are satisfied and are adequately documented, and in supporting decisions for the statistical analyses. Here we focus on the data screening step of IDA, where data properties are examined and effective visualizations are a fundamental tool. The objective of our work is to present recommendations on how to implement an IDA plan, how to create visualizations that are effective for the IDA, and make use of the IDA findings.
We present tutorial examples on how to conduct IDA data screening based on Pokémon data. More than 1000 different Pokémon species exist, which can be grouped in evolution chains. Several statistics and information describing each species are available, including their weight, height, (proportion of) gender, along with numerous statistics that describe the Pokémon’s ability in a battle.
We define two research questions: (i) what are the predictors of Pokémon’s height? (ii) some Pokémon species have unknow gender, what are their predictors? We present a brief statistical analysis plan (SAP) and develop and IDA data screening plan for the two research questions; we present the IDA report, which is implemented in the R language using a reproducible markup document that includes numerous visualizations. We discuss the interpretation of the results and the consequences of the IDA results, which indicate possible changes in the suggested SAP. We end by briefly discussing the use of IDA in the context of high-dimensional data, where the number of variables is extremely large.