Before we built any models, we always look at the descriptive statstics of data first. Usually we have a lot of variable. If we print/plot them in one file, it’s hard for people to search and read. A simple R shinyapp can handle this easily. Currently I’m working on Kaggle competion “Allstate Claims Severity”. (https://www.kaggle.com/c/allstate-claims-severity). The dataset has more than 100 variable. Here I will show you how I built an R shinyapp that you can select the variable and plot you want. You can apply it to any data set and any plot.

Preprocessing data

After downloading the data, we have two files, the training set and the test set. First we can combine two sets together. Since the test doesn’t have the predictor “loss”, we just need to add a column called “loss” with all “NA” in it. We also need to add a variable for both set to indicate the traning set and the test set. Then we rowbind this two sets into one dataframe. Since the file is too big, we can save it as a smaller RDS file so the shinyapp can read it faster.

test <- read.csv("test.csv")
training <- read.csv("training.csv")

# There is no loss variable in test file.
test$loss <- NA

test$type <- "test"
training$type <- "training"

alldata <- rbind(training, test)
saveRDS(alldata)

shinyUI

Now we begin to build our shinyapp. A shinyapp is made of two parts, the “shinyUI” part and the “shinyServer” part. The “shinyUI” part is for input, and the “shinyServer” part is for output.

We use the “selectInput” function to select the variable, which data set, and the type of plot. All this will be in the list variable called “input”.

library(shiny)
library(ggplot2)

ui <- shinyUI(fluidPage(
   
   # Application title
   titlePanel("Allstate data Descriptive statistics"),
   
   sidebarLayout(
      sidebarPanel(
        
        # select dataset 
        selectInput("Dataset", "Training/Test", 
                    choices = unique(alldata$type)),
        
        # select variable
        selectInput("Variable", "Column", 
                    choices = colnames(alldata)[-1]),
        
        # select plot type
        selectInput("Plot", "Plot Type", 
                    choices = c("Histogram", "QQplot"))
      ),
      
      mainPanel(
        
        # The plot is called Descriptive and will be created in ShinyServer part
        plotOutput("Descriptive")
      )
   )
))

ShinyServer

The “ShinyServer” part controls the output of shinyapp. First we need to subtract the variables we want. There are two types of variable in this data set, continuous variable and the categorical variable. We need to use different code to plot.

server <- shinyServer(function(input, output) {
  
  output$Descriptive <- renderPlot({
    
    # subset of data
    plotdata <- alldata[ alldata$type == input$Dataset, input$Variable]
    
    # choose the type of plot
    if (input$Plot == "Histogram"){
      
      # whether the variable is continuous or not
      if (substr(input$Variable, 1,4) == "cont"){ 

        # histogram for continuous variable
        ggplot(data.frame(plotdata),aes(x=plotdata))+ geom_histogram()
        
      } else {
        
        # barplot for categorical variable
        ggplot(data.frame(plotdata),aes(x=plotdata))+ geom_bar()
        
      }
      
      # if select the QQplot
   } else if (input$Plot == "QQplot") {
     
      # whether the variable is continuous or not
      if (substr(input$Variable, 1,4) == "cont"){
        
        # QQplot for continuous variables
        ggplot(data.frame(plotdata),aes(sample=plotdata)) + stat_qq()
        
      }}
  }) 
})

# Run the application 
shinyApp(ui = ui, server = server)