This is a dataset contains 3808 observarions(businesses) and 225 variables(economic scale measurements), we should first take a look at some simple statistics. I chose to analyze the first 17 variables because the whole set of variables is too large to present in a graph. Furthermore, I selected these variables because they are easier to interpret than the others.

Descriptive statistics

Simple statistics

First of all, we should watch the simple statistics by each variable. However, I think the dataset is in a huge size, so we were forced to set the simple statistics in the appendix. You should take a look at the simple statistics first right now.

The distribution of single variable

Next, we want to examine the dataset carefully by each variable, and hence we apply the ggplot function to check the variables distributions clearly.

Revenue

Take a look at this plot. We see that the dataset is highly right skewed and even contains some businesses which have a negative revenue.

Revenue Growth amount

Next we analyze the distribution of revenue growth amount:

The revenue growth amount plot is fairly symmetric about its center(i.e. 0), and a high percentage of the standard growth rate concentrate around 0. We can check this by obeserving that the percentage of absolute_ Revenue.Growth > 1 is about 5%

Cost of Revenue

And the next one is the cost of revenue in each firm:

Similarly, the distribution of the cost of revenue is highly skewed to the right, and we see that the lower part of 76% values was quite far from the mean (1e9 v.s 3e9).

The gross profit

The last item we should present is the distribution of the gross profit:

We see that the distribution of the gross profit is strongly concentrated around 0, it’s an interesting phenominon that violates with our common sense. It says that the major part of the business does not earn much money in 2014, and we can even imagine that the phenominon is the same in the next few years.

EDA

sector v.s. grossProfit

Next we gonna examine the index plot in comparison with the gross profit. In addition, we classify the sector in each observetion so that we can check the distribution of gross profit in each ‘industial’ directly.

We can see that:
1. The distribution of the gross profit is quite low in the ‘real estate’ industry. 2. The gross profit in the ‘communication services’ and ‘financial services’ is generally higher than the others.

Missing value handling methon description

Since this dataset is huge, it is natural to have a considerable part of missing value to cope with. At this point, we first deleted the two categorical variables since they are not convenient to engage into the PCA and heatmap analysis. Next we deleted two empty variables. We then substitute the mean value in each column into the missing one in each column.

We omit the handling code here.

PCA

At this point, I should confess that the dataset is to large to analyze, and I was quite unfamilar with some variables. And hence I chose some common variables they I had learned in a financial accounting course. The variables to be analyze are:

Revenue, Gross Profit, Operating Expenses, Net Income, Cash and cash equivalents, Receivables. Total current assets, Total current liabilities

## [1] "The loading of PCA"

## 
## Loadings:
##                             Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## Revenue                      0.659         0.221         0.331              
## Cost.of.Revenue              0.498  0.270  0.449 -0.155        -0.107       
## Gross.Profit                 0.161 -0.266 -0.230  0.180  0.395  0.113  0.155
## SG.A.Expense                       -0.155                0.247              
## Operating.Expenses                 -0.232 -0.162  0.169  0.468              
## Net.Income                                        0.197 -0.210 -0.344  0.535
## Consolidated.Income                               0.197 -0.207 -0.344  0.536
## Cash.and.cash.equivalents          -0.761  0.126 -0.600 -0.179              
## Receivables                        -0.181         0.152        -0.772 -0.562
## Total.current.assets         0.420        -0.754        -0.298              
## Total.current.liabilities    0.274  0.110 -0.158 -0.136 -0.282              
## Retained.earnings..deficit.  0.158 -0.371  0.221  0.655 -0.403  0.370 -0.235
##                             Comp.8 Comp.9 Comp.10 Comp.11 Comp.12
## Revenue                             0.236  0.195   0.551         
## Cost.of.Revenue              0.182 -0.225 -0.205  -0.553         
## Gross.Profit                -0.164  0.369  0.276  -0.620         
## SG.A.Expense                       -0.817  0.478                 
## Operating.Expenses                 -0.232 -0.775                 
## Net.Income                                                -0.707 
## Consolidated.Income                                        0.707 
## Cash.and.cash.equivalents                                        
## Receivables                 -0.104  0.119                        
## Total.current.assets         0.381                               
## Total.current.liabilities   -0.876 -0.124                        
## Retained.earnings..deficit.                                      
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.083  0.083  0.083  0.083  0.083  0.083  0.083  0.083  0.083
## Cumulative Var  0.083  0.167  0.250  0.333  0.417  0.500  0.583  0.667  0.750
##                Comp.10 Comp.11 Comp.12
## SS loadings      1.000   1.000   1.000
## Proportion Var   0.083   0.083   0.083
## Cumulative Var   0.833   0.917   1.000

We can see an interesting phenominon here: the sum of square of loading are all 1, suggesting the princomp function has a special transformation to the variables.

Heatmap

Since this dataset contains 3809 obeservations, it is quite hard to analyze the whole set of firms. We delete some observation containing the missing value and select a random sample of 50 firms to plot the heatmap.

The dataset was transformed into the standardized one before the plotting of this heatmap, and thus each column was concentrated about 0 . However, we can see the violation of some variables from this heatmap. For example, the “Cost.of.Revenue” contains more observations with higher standard score, suggesting that the distribution contains some extreme outlier.

Appendix

Simple statistics

##     Revenue           Revenue.Growth     Cost.of.Revenue     
##  Min.   :-6.276e+08   Min.   :   -1.77   Min.   :-5.456e+08  
##  1st Qu.: 5.789e+07   1st Qu.:    0.00   1st Qu.: 3.136e+06  
##  Median : 4.349e+08   Median :    0.06   Median : 1.414e+08  
##  Mean   : 5.879e+09   Mean   :   12.95   Mean   : 3.701e+09  
##  3rd Qu.: 2.394e+09   3rd Qu.:    0.19   3rd Qu.: 1.200e+09  
##  Max.   : 1.825e+12   Max.   :42138.66   Max.   : 1.537e+12  
##  NA's   :44           NA's   :236        NA's   :74          
##   Gross.Profit         R.D.Expenses         SG.A.Expense      
##  Min.   :-1.105e+09   Min.   :-1.500e+05   Min.   :0.000e+00  
##  1st Qu.: 3.093e+07   1st Qu.: 0.000e+00   1st Qu.:1.549e+07  
##  Median : 1.909e+08   Median : 0.000e+00   Median :7.382e+07  
##  Mean   : 2.188e+09   Mean   : 9.402e+07   Mean   :9.307e+08  
##  3rd Qu.: 8.923e+08   3rd Qu.: 9.911e+06   3rd Qu.:3.510e+08  
##  Max.   : 4.622e+11   Max.   : 1.154e+10   Max.   :1.857e+11  
##  NA's   :52           NA's   :136          NA's   :59         
##  Operating.Expenses   Operating.Income     Interest.Expense    
##  Min.   :-1.088e+09   Min.   :-6.786e+09   Min.   :-2.250e+08  
##  1st Qu.: 3.107e+07   1st Qu.:-1.308e+06   1st Qu.: 0.000e+00  
##  Median : 1.387e+08   Median : 4.104e+07   Median : 2.563e+06  
##  Mean   : 1.438e+09   Mean   : 6.748e+08   Mean   : 1.002e+08  
##  3rd Qu.: 5.885e+08   3rd Qu.: 2.713e+08   3rd Qu.: 4.300e+07  
##  Max.   : 3.056e+11   Max.   : 1.566e+11   Max.   : 3.152e+10  
##  NA's   :63           NA's   :55           NA's   :63          
##  Earnings.before.Tax  Income.Tax.Expense   Net.Income...Non.Controlling.int
##  Min.   :-8.878e+09   Min.   :-2.081e+09   Min.   :-1.587e+09              
##  1st Qu.:-3.733e+06   1st Qu.: 0.000e+00   1st Qu.: 0.000e+00              
##  Median : 2.843e+07   Median : 5.335e+06   Median : 0.000e+00              
##  Mean   : 5.726e+08   Mean   : 1.763e+08   Mean   : 1.558e+07              
##  3rd Qu.: 2.194e+08   3rd Qu.: 5.753e+07   3rd Qu.: 0.000e+00              
##  Max.   : 8.720e+10   Max.   : 3.971e+10   Max.   : 4.917e+09              
##  NA's   :80           NA's   :66           NA's   :149                     
##  Net.Income...Discontinued.ops   Net.Income         Preferred.Dividends 
##  Min.   :-7.015e+09            Min.   :-8.360e+09   Min.   : -12628000  
##  1st Qu.: 0.000e+00            1st Qu.:-3.693e+06   1st Qu.:         0  
##  Median : 0.000e+00            Median : 2.223e+07   Median :         0  
##  Mean   :-3.454e+06            Mean   : 4.894e+08   Mean   :   5343177  
##  3rd Qu.: 0.000e+00            3rd Qu.: 1.658e+08   3rd Qu.:         0  
##  Max.   : 8.368e+09            Max.   : 2.340e+11   Max.   :2741588000  
##  NA's   :149                   NA's   :23           NA's   :149         
##  Net.Income.Com            EPS            
##  Min.   :-8.360e+09   Min.   :-101870898  
##  1st Qu.:-4.369e+06   1st Qu.:         0  
##  Median : 2.101e+07   Median :         1  
##  Mean   : 4.839e+08   Mean   :    -26074  
##  3rd Qu.: 1.631e+08   3rd Qu.:         2  
##  Max.   : 2.340e+11   Max.   :   8028004  
##  NA's   :15           NA's   :72

Midterm- Report

William

21/12/2020

Descriptive statistics

Simple statistics

The distribution of single variable

Revenue

Revenue Growth amount

Cost of Revenue

The gross profit

EDA

sector v.s. grossProfit

Missing value handling methon description

PCA

Heatmap

Appendix

Simple statistics