The purpose of this case is * to realize a 1st case in the framework of the course * to answer 3 questions :
The stakeholders : The analysis will be a contribution to the work realized by the members of the Manhattan Chocolate Society. We will consider they are interested to have a summary of the research they made on the quality of the chocolates on the world
The first level of questions to be answered is • Where are the best cocoa beans grown? • Which countries produce the highest-rated bars? • What’s the relationship between cocoa solids percentage and rating?
By personal interest, we will see if Belgium, my country, supposed to be a top class chocolate producer, get better results than the average other countries
The dataset This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate bean used and where the beans were grown. The dataset has a rating of 7.65 by Kagle I have no information about the way the dataset has really been created. Qualitative information like Ratings) are subjective. For the exercise I am doing here, no issue about this
The dataset has a limited size. It means we could easily make use of Google Sheets to complete the analysis. We will try to import the .csv in Rstudio in order to practice R and to generate ggplot2 vizualisation
Cleaning of the data
Output to be generated to answer the questions • Where are the best
cocoa beans grown? Rank of the Country Bean Origin according to their
rating
• Which countries produce the highest-rated bars? Company, Rating
sorted by rating descending . Top 10 • What’s the relationship between
cocoa solids percentage and rating? Avg rating group_by CocoaPercent
Note : setting up my environment by loading the ‘tidyverse’ and ‘choc’ packages
install.packages("kableExtra")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readxl)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
Choc <- read_excel("Chocolat_Bean_bg_v1 .xlsx")
a <- Choc %>% group_by(BroadBeanOrigin)
b <- summarise(a, Avg_rat = mean(Rating))
c <- arrange(b, desc(Avg_rat))
head(c, 20)
## # A tibble: 20 × 2
## BroadBeanOrigin Avg_rat
## <chr> <dbl>
## 1 Domincan Republic, Madagascar 4
## 2 Grenada., Papua New Guinea, Haw., Haiti, Madagascar 4
## 3 Guatemala., Dominican Republic, Peru, Madagascar., Papua New Guinea 4
## 4 Peru, Dominican Republic 4
## 5 Venezuela, Bolivia, Dominican Republic 4
## 6 Venezuela, Java 4
## 7 Domincan Republic, Bali 3.75
## 8 Dominican Republic, Ecuador, Peru 3.75
## 9 Papua New Guinea, Vanuatu, Madagascar 3.75
## 10 Peru, Belize 3.75
## 11 Venezuela, Africa, Brasil, Peru, Mexico 3.75
## 12 Venezuela, Ecuador, Peru, Nicaragua 3.75
## 13 South America 3.67
## 14 Tobago 3.62
## 15 Indonesia, Ghana 3.5
## 16 Madagascar, Java, Papua New Guinea 3.5
## 17 Peru, Madagascar, Dominican Republic 3.5
## 18 Trinidad, Ecuador 3.5
## 19 Venezuela., Indonesia, Ecuador 3.5
## 20 Solomon Islands 3.44
Plot this way
Plot this way
Plot this way
De bests rated chocolate are from blended beans. They are occupying the first 20 places
De bests producers of chocolate are located all over the world. Most Belgian and Swiss people do believe they produce the best chocolates , they appear on rank 11 and 20 in the ranking
Surprise : How less cocoa, How better the ranking. Which could have one explanation : People are more and more used to consume sweet products, full of sugar. The actual tastes give more points to sweet chocolates(with less beans and more sugar), than to highly cocoa concentrated products (which are often the more expensive). With other words, if you want to make money, put less cocoa and more sugar, you will get better return.
Note that the echo = FALSE
parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.