The purpose of this case is * to realize a 1st case in the framework of the course * to answer 3 questions :

• Where are the best cocoa beans grown?
• Which countries produce the highest-rated bars?
• What’s the relationship between cocoa solids percentage and rating?

## 1. The questions

1. The stakeholders : The analysis will be a contribution to the work realized by the members of the Manhattan Chocolate Society. We will consider they are interested to have a summary of the research they made on the quality of the chocolates on the world

2. The first level of questions to be answered is • Where are the best cocoa beans grown? • Which countries produce the highest-rated bars? • What’s the relationship between cocoa solids percentage and rating?

3. By personal interest, we will see if Belgium, my country, supposed to be a top class chocolate producer, get better results than the average other countries

## 2. Preparation

1. The dataset This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate bean used and where the beans were grown. The dataset has a rating of 7.65 by Kagle I have no information about the way the dataset has really been created. Qualitative information like Ratings) are subjective. For the exercise I am doing here, no issue about this

2. The dataset has a limited size. It means we could easily make use of Google Sheets to complete the analysis. We will try to import the .csv in Rstudio in order to practice R and to generate ggplot2 vizualisation

3. Cleaning of the data

• The column “REF” is described as A value linked to when the review was entered in the database. Higher = more recent.  I don’t really see how to make use of such an information
• 2 columns are containing origin info : “Specific Bean Origin or Bar name” and “Broad Bean Origin”. Only Broad Bean Origin” clearly indicates the “country origin” and will be used
• “Bean type” is empty in 888 rows on 1796
• “Amsterdam” as Company Location has been replaced by “Netherlands”
• “Rating” as been set as “Numeric”
• “Cocoa Percent” as been set as “Percent”
• Several “Company location” corrected
• “Broad Bean origin” missing in 74 rows –
• “Broad Bean origin” corrected in lots of cases
• “Broad Bean origin” Split function with “,” delimiter into 4 rows
• “Broad Bean origin” is a mix of countries and world zone

## 3. Processing of the dataset

Output to be generated to answer the questions • Where are the best cocoa beans grown?  Rank of the Country Bean Origin according to their rating
• Which countries produce the highest-rated bars?  Company, Rating sorted by rating descending . Top 10 • What’s the relationship between cocoa solids percentage and rating? Avg rating group_by CocoaPercent

## Setting up my environment

Note : setting up my environment by loading the ‘tidyverse’ and ‘choc’ packages

install.packages("kableExtra")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::lag()    masks stats::lag()
library(readxl)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
##     group_rows
Choc <- read_excel("Chocolat_Bean_bg_v1 .xlsx")

# Question 1 : TOP 20 of the best Bean Origin of Cocoa

a <- Choc %>% group_by(BroadBeanOrigin)
b <- summarise(a, Avg_rat = mean(Rating))
c <- arrange(b, desc(Avg_rat))
head(c, 20)
## # A tibble: 20 × 2
##    <chr>                                                                  <dbl>
##  1 Domincan Republic, Madagascar                                           4
##  3 Guatemala., Dominican Republic,  Peru, Madagascar., Papua New Guinea    4
##  4 Peru, Dominican Republic                                                4
##  5 Venezuela, Bolivia, Dominican Republic                                  4
##  6 Venezuela, Java                                                         4
##  7 Domincan Republic, Bali                                                 3.75
##  8 Dominican Republic, Ecuador, Peru                                       3.75
##  9 Papua New Guinea, Vanuatu, Madagascar                                   3.75
## 10 Peru, Belize                                                            3.75
## 11 Venezuela, Africa, Brasil, Peru, Mexico                                 3.75
## 12 Venezuela, Ecuador, Peru, Nicaragua                                     3.75
## 13 South America                                                           3.67
## 14 Tobago                                                                  3.62
## 15 Indonesia, Ghana                                                        3.5
## 16 Madagascar, Java, Papua New Guinea                                      3.5
## 17 Peru, Madagascar, Dominican Republic                                    3.5
## 19 Venezuela., Indonesia, Ecuador                                          3.5
## 20 Solomon Islands                                                         3.44

## Conclusions

• De bests rated chocolate are from blended beans. They are occupying the first 20 places

• De bests producers of chocolate are located all over the world. Most Belgian and Swiss people do believe they produce the best chocolates , they appear on rank 11 and 20 in the ranking

• Surprise : How less cocoa, How better the ranking. Which could have one explanation : People are more and more used to consume sweet products, full of sugar. The actual tastes give more points to sweet chocolates(with less beans and more sugar), than to highly cocoa concentrated products (which are often the more expensive). With other words, if you want to make money, put less cocoa and more sugar, you will get better return.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.