--- title: "Grupo Bimbo Test" author: "ChenHsi Shen & Thomas" date: "2017-05-08" output: html_document: theme: cosmo --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) # Please write down your working directory below. knitr::opts_knit$set(root.dir = normalizePath("~/Documents/taipei_uni/grupo/data/")) ``` *** ## Easy Manipulation and Visualization *** Reference: https://www.kaggle.com/fabienvs/grupo-bimbo-data-analysis ### Step 1: Read all the libraryies you need ```{r, include=FALSE} library(dplyr) library(readr) library(ggplot2) library(scales) library(treemap) library(data.table) library(tidyverse) library(maptools) ``` *** ### Step 2: Read the data. ```{r} train <- fread('raw/adjusted_data.csv') town <- fread('raw/town_state.csv') client <- fread('raw/cliente_tabla.csv') Product <- fread('raw/producto_tabla.csv') ``` ### Step 3: Start Exploratory Data Analysis --- ### Part 1: The Present Varibles #### Week To see if there's a difference between weeks. *** ```{r} Week <- train %>% group_by(Semana) %>% summarise(Net = sum(Demanda_uni_equil)) ggplot(Week)+geom_bar(aes(x=as.factor(Week[[1]]),y=Week[[2]]), stat='identity', col='#317778',fill='#317778')+ labs(title='Demand by Week',x='Week',y='Demand') ``` We can see there's no obvious difference among weeks. *** #### Agencias According to the introductin on Kaggle, Agencias is refered to the place they store their inventory to be delivered. We would like to use treemap to see the distribution of each Agencias. *** ```{r} Agencias <- train %>% group_by(Agencia_ID) %>% summarise(Sales = sum(Demanda_uni_equil)) treemap(Agencias,index=colnames(Agencias)[1],vSize=colnames(Agencias)[2],vColor='#317778',title = 'Demand by Agencias') ``` However, the color is not well-used here, which implies we can put further information on this treemap. We choose `Venta_hoy`, price, as the color of this chart. Furthermore, we choose 60 Agencias with the most demand. ```{r} Agencias <- train %>% group_by(Agencia_ID) %>% summarise(Sales = sum(Demanda_uni_equil), Price = mean(Venta_hoy)) %>% arrange(desc(Sales)) treemap(Agencias[1:60,],index = colnames(Agencias)[1], vSize = colnames(Agencias)[2], vColor = colnames(Agencias)[3], palette=c('#FFFFFF','#FFFFFF','#025055'), type='value', title.legend = 'Average Price', title = 'Demand by Agencias') ``` #### State We can merge state information from agencias. *** ```{r} Agencias <- left_join(Agencias, town, by='Agencia_ID') AgenciasState <- Agencias %>% group_by(State) %>% summarise(Sales = sum(Sales)) state_map = c("Tabasco", "Tlaxcala", "Baja California Sur", "Yucatan", "Campeche", "Baja California", "Queretaro", "Tamaulipas", "Sinaloa", "San Luis Potosi", "Chihuahua", "None", "None", "Quintana Roo", "Veracruz", "Colima", "Coahuila", "Mexico", "Guerrero", "Michoacan", "Chiapas", "Guanajuato", "Zacatecas", "Durango", "Puebla", "Sonora", "Oaxaca", "Aguascalientes", "Jalisco", "Morelos", "Hidalgo", "Nayarit", "Nuevo Leon") names(state_map) = c("TABASCO", "TLAXCALA", "BAJA CALIFORNIA SUR", "YUCATÁN", "CAMPECHE", "BAJA CALIFORNIA NORTE", "QUERETARO", "TAMAULIPAS", "SINALOA", "SAN LUIS POTOSÍ", "CHIHUAHUA", "ESTADO DE MÉXICO", "Queretaro de Arteaga", "QUINTANA ROO", "VERACRUZ", "COLIMA", "COAHUILA", "MÉXICO, D.F.", "GUERRERO", "MICHOACÁN", "CHIAPAS", "GUANAJUATO", "ZACATECAS", "DURANGO", "PUEBLA", "SONORA", "OAXACA", "AGUASCALIENTES", "JALISCO", "MORELOS", "HIDALGO", "NAYARIT", "NUEVO LEÓN") AgenciasState$State <- sapply(AgenciasState$State, function(x) state_map[x]) setorderv(AgenciasState, 'Sales', order=-1) colors = c(rep("#3366FF", 10), rep("#6699FF", 10), rep("#66CCFF", 20)) plot = ggplot() area = readShapePoly("mexstates/mexstates.shp") for (i in 1:32) { fill = colors[match(area$ADMIN_NAME[i], AgenciasState$State)] plot = plot + geom_polygon(data=area[i, 3], aes(long, lat, group = group), colour = alpha("darkred", 1/2), size = 0.7, fill = fill, alpha = .5) } print(plot) ``` #### Product We can also take a look on how product contributed to the demand. Before we start, we must remember what '80/20' tells us. *** ```{r} ProductTrain <- train %>% group_by(Producto_ID) %>% summarise( Sales = sum(Demanda_uni_equil)) %>% mutate( P_Character = factor(Producto_ID, levels = as.character(Producto_ID)))%>% arrange(desc(Sales)) ggplot(ProductTrain)+geom_bar(aes(x=ProductTrain[[3]], y=ProductTrain[[2]]), stat='identity', col='#317778',fill='#317778')+ scale_x_discrete(limits=ProductTrain[[3]])+ labs(title= 'Demand by Product',x='Product',y='Demand') ``` So, we merely choose 25% of the product, 250, as the target for our treemap. ```{r} ProductTrain <- train %>% group_by(Producto_ID) %>% summarise( Sales = sum(Demanda_uni_equil), Price = mean(Venta_hoy)) %>% arrange(desc(Sales)) treemap(ProductTrain[1:250,],index = colnames(ProductTrain)[1], vSize = colnames(ProductTrain)[2], vColor = colnames(ProductTrain)[3], palette=c('#FFFFFF','#FFFFFF','#025055'), type='value', title.legend = 'Average Price', title = 'Demand by Product') ``` ### Part 2: The Generated Variables Grupo Bimbo has offered us further information about products. With the manipulation and extraction of features from `producto_tabla.csv`, we'll find something interesting. Firstly, we must join two data by `Producto_ID` ```{r} ProductInfo <- train %>% inner_join(Product,by='Producto_ID') ``` We can easily find out that the `NombreProducto` is consisted of three part: * The full name of products * The weight, either in g or kg * The packing style, like how many pieces are there in the product. Inorder to extract all the information, we must use the package `stringr` and skills on `Regular Expression`. #### Extraction of Flavor ```{r} FlavorName <- c('Choco','Va(i)?nilla','Multigrano','Sandwich','Clas(s)?ic') FlavorTable <- data.table( ifelse(ProductInfo$NombreProducto %in% Product$NombreProducto[grep(FlavorName[1],Product$NombreProducto)],1,0) ) colnames(FlavorTable) <- FlavorName[1] for(i in 2:length(FlavorName)){ FlavorTable <- FlavorTable[,FlavorName[i] := ifelse(ProductInfo$NombreProducto %in% Product$NombreProducto[grep(FlavorName[i],Product$NombreProducto)],1,0) ] } FlavorTable[1:20,] ``` *** #### Extraction of Weight and Packing ##### Weight The units of weight are either g or Kg. We can extract the exact number with matching the pattern. Since the data.table is not too slim, it may take a while to finish this. ```{r} library(stringr) weight <- function(x){ StringList <- strsplit(x," ")[[1]] Index <- grep("\\d+[Kg|g]",StringList) weight <- str_match(StringList[Index],"(\\d+)(Kg|g)") ifelse(weight[3] %in% 'Kg', weight[2] <- as.numeric(weight[2])*1000, weight[2] <- weight[2]) WeightFinal <- as.numeric(weight[2]) return(WeightFinal) } w <- unlist(lapply(ProductInfo$NombreProducto, function(x) weight(x))) w[1:100] ``` ##### Packing The unit of packing is p, refered to pieces. We can extract the exact number with matching the pattern. Since the data.table is not too slim, it may take a while to finish this. ```{r, message=FALSE, warning=FALSE} packing <- function(x){ StringList <- strsplit(x," ")[[1]] Index <- grep("\\d+[p]",StringList) Packing <- str_match(StringList[Index],"(\\d+)(p)")[2] ifelse(is.na(Packing), Packing <- 0, Packing <- Packing) PackingFinal <- as.numeric(Packing) return(PackingFinal) } p <- unlist(lapply(ProductInfo$NombreProducto, function(x) packing(x))) p[1:100] ``` ### Step 3: Put them altogether ```{r} NewData <- train %>% bind_cols(FlavorTable) %>% bind_cols(data.table(w)) %>% bind_cols(data.table(p)) fwrite(NewData, file="train_merged.csv") ```