The purpose of this case is * to realize a 1st case in the framework of the course * to answer 3 questions :

1. The questions

  1. The stakeholders : The analysis will be a contribution to the work realized by the members of the Manhattan Chocolate Society. We will consider they are interested to have a summary of the research they made on the quality of the chocolates on the world

  2. The first level of questions to be answered is • Where are the best cocoa beans grown? • Which countries produce the highest-rated bars? • What’s the relationship between cocoa solids percentage and rating?

  3. By personal interest, we will see if Belgium, my country, supposed to be a top class chocolate producer, get better results than the average other countries

2. Preparation

  1. The dataset This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate bean used and where the beans were grown. The dataset has a rating of 7.65 by Kagle I have no information about the way the dataset has really been created. Qualitative information like Ratings) are subjective. For the exercise I am doing here, no issue about this

  2. The dataset has a limited size. It means we could easily make use of Google Sheets to complete the analysis. We will try to import the .csv in Rstudio in order to practice R and to generate ggplot2 vizualisation

  3. Cleaning of the data

3. Processing of the dataset

Output to be generated to answer the questions • Where are the best cocoa beans grown?  Rank of the Country Bean Origin according to their rating
• Which countries produce the highest-rated bars?  Company, Rating sorted by rating descending . Top 10 • What’s the relationship between cocoa solids percentage and rating? Avg rating group_by CocoaPercent

Details on the study to be consulted on the sagacity blog

Click here

Setting up my environment

Note : setting up my environment by loading the ‘tidyverse’ and ‘choc’ packages

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##     group_rows
Choc <- read_excel("Chocolat_Bean_bg_v1 .xlsx")

Analyze of the results and Vizualisation

Question 1 : TOP 20 of the best Bean Origin of Cocoa

a <- Choc %>% group_by(BroadBeanOrigin)
b <- summarise(a, Avg_rat = mean(Rating))
c <- arrange(b, desc(Avg_rat))
head(c, 20)
## # A tibble: 20 × 2
##    BroadBeanOrigin                                                      Avg_rat
##    <chr>                                                                  <dbl>
##  1 Domincan Republic, Madagascar                                           4   
##  2 Grenada., Papua New Guinea, Haw., Haiti, Madagascar                     4   
##  3 Guatemala., Dominican Republic,  Peru, Madagascar., Papua New Guinea    4   
##  4 Peru, Dominican Republic                                                4   
##  5 Venezuela, Bolivia, Dominican Republic                                  4   
##  6 Venezuela, Java                                                         4   
##  7 Domincan Republic, Bali                                                 3.75
##  8 Dominican Republic, Ecuador, Peru                                       3.75
##  9 Papua New Guinea, Vanuatu, Madagascar                                   3.75
## 10 Peru, Belize                                                            3.75
## 11 Venezuela, Africa, Brasil, Peru, Mexico                                 3.75
## 12 Venezuela, Ecuador, Peru, Nicaragua                                     3.75
## 13 South America                                                           3.67
## 14 Tobago                                                                  3.62
## 15 Indonesia, Ghana                                                        3.5 
## 16 Madagascar, Java, Papua New Guinea                                      3.5 
## 17 Peru, Madagascar, Dominican Republic                                    3.5 
## 18 Trinidad, Ecuador                                                       3.5 
## 19 Venezuela., Indonesia, Ecuador                                          3.5 
## 20 Solomon Islands                                                         3.44

World map of the best Bean Origin of Cocoa:

Plot this way

Question 2 : The best producers of Chocolate:

Plot this way


  • De bests rated chocolate are from blended beans. They are occupying the first 20 places

  • De bests producers of chocolate are located all over the world. Most Belgian and Swiss people do believe they produce the best chocolates , they appear on rank 11 and 20 in the ranking

  • Surprise : How less cocoa, How better the ranking. Which could have one explanation : People are more and more used to consume sweet products, full of sugar. The actual tastes give more points to sweet chocolates(with less beans and more sugar), than to highly cocoa concentrated products (which are often the more expensive). With other words, if you want to make money, put less cocoa and more sugar, you will get better return.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.