This is a dataset contains 3808 observarions(businesses) and 225 variables(economic scale measurements), we should first take a look at some simple statistics. I chose to analyze the first 17 variables because the whole set of variables is too large to present in a graph. Furthermore, I selected these variables because they are easier to interpret than the others.
First of all, we should watch the simple statistics by each variable. However, I think the dataset is in a huge size, so we were forced to set the simple statistics in the appendix. You should take a look at the simple statistics first right now.
Next, we want to examine the dataset carefully by each variable, and hence we apply the ggplot function to check the variables distributions clearly.
Take a look at this plot. We see that the dataset is highly right skewed and even contains some businesses which have a negative revenue.
Next we analyze the distribution of revenue growth amount:
The revenue growth amount plot is fairly symmetric about its center(i.e. 0), and a high percentage of the standard growth rate concentrate around 0. We can check this by obeserving that the percentage of absolute_ Revenue.Growth > 1 is about 5%
And the next one is the cost of revenue in each firm:
Similarly, the distribution of the cost of revenue is highly skewed to the right, and we see that the lower part of 76% values was quite far from the mean (1e9 v.s 3e9).
The last item we should present is the distribution of the gross profit:
We see that the distribution of the gross profit is strongly concentrated around 0, it’s an interesting phenominon that violates with our common sense. It says that the major part of the business does not earn much money in 2014, and we can even imagine that the phenominon is the same in the next few years.
Next we gonna examine the index plot in comparison with the gross profit. In addition, we classify the sector in each observetion so that we can check the distribution of gross profit in each ‘industial’ directly.
We can see that:
1. The distribution of the gross profit is quite low in the ‘real estate’ industry. 2. The gross profit in the ‘communication services’ and ‘financial services’ is generally higher than the others.
Since this dataset is huge, it is natural to have a considerable part of missing value to cope with. At this point, we first deleted the two categorical variables since they are not convenient to engage into the PCA and heatmap analysis. Next we deleted two empty variables. We then substitute the mean value in each column into the missing one in each column.
We omit the handling code here.
At this point, I should confess that the dataset is to large to analyze, and I was quite unfamilar with some variables. And hence I chose some common variables they I had learned in a financial accounting course. The variables to be analyze are:
Revenue, Gross Profit, Operating Expenses, Net Income, Cash and cash equivalents, Receivables. Total current assets, Total current liabilities
## [1] "The loading of PCA"
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## Revenue 0.659 0.221 0.331
## Cost.of.Revenue 0.498 0.270 0.449 -0.155 -0.107
## Gross.Profit 0.161 -0.266 -0.230 0.180 0.395 0.113 0.155
## SG.A.Expense -0.155 0.247
## Operating.Expenses -0.232 -0.162 0.169 0.468
## Net.Income 0.197 -0.210 -0.344 0.535
## Consolidated.Income 0.197 -0.207 -0.344 0.536
## Cash.and.cash.equivalents -0.761 0.126 -0.600 -0.179
## Receivables -0.181 0.152 -0.772 -0.562
## Total.current.assets 0.420 -0.754 -0.298
## Total.current.liabilities 0.274 0.110 -0.158 -0.136 -0.282
## Retained.earnings..deficit. 0.158 -0.371 0.221 0.655 -0.403 0.370 -0.235
## Comp.8 Comp.9 Comp.10 Comp.11 Comp.12
## Revenue 0.236 0.195 0.551
## Cost.of.Revenue 0.182 -0.225 -0.205 -0.553
## Gross.Profit -0.164 0.369 0.276 -0.620
## SG.A.Expense -0.817 0.478
## Operating.Expenses -0.232 -0.775
## Net.Income -0.707
## Consolidated.Income 0.707
## Cash.and.cash.equivalents
## Receivables -0.104 0.119
## Total.current.assets 0.381
## Total.current.liabilities -0.876 -0.124
## Retained.earnings..deficit.
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083
## Cumulative Var 0.083 0.167 0.250 0.333 0.417 0.500 0.583 0.667 0.750
## Comp.10 Comp.11 Comp.12
## SS loadings 1.000 1.000 1.000
## Proportion Var 0.083 0.083 0.083
## Cumulative Var 0.833 0.917 1.000
We can see an interesting phenominon here: the sum of square of loading are all 1, suggesting the princomp function has a special transformation to the variables.
Since this dataset contains 3809 obeservations, it is quite hard to analyze the whole set of firms. We delete some observation containing the missing value and select a random sample of 50 firms to plot the heatmap.
The dataset was transformed into the standardized one before the plotting of this heatmap, and thus each column was concentrated about 0 . However, we can see the violation of some variables from this heatmap. For example, the “Cost.of.Revenue” contains more observations with higher standard score, suggesting that the distribution contains some extreme outlier.
## Revenue Revenue.Growth Cost.of.Revenue
## Min. :-6.276e+08 Min. : -1.77 Min. :-5.456e+08
## 1st Qu.: 5.789e+07 1st Qu.: 0.00 1st Qu.: 3.136e+06
## Median : 4.349e+08 Median : 0.06 Median : 1.414e+08
## Mean : 5.879e+09 Mean : 12.95 Mean : 3.701e+09
## 3rd Qu.: 2.394e+09 3rd Qu.: 0.19 3rd Qu.: 1.200e+09
## Max. : 1.825e+12 Max. :42138.66 Max. : 1.537e+12
## NA's :44 NA's :236 NA's :74
## Gross.Profit R.D.Expenses SG.A.Expense
## Min. :-1.105e+09 Min. :-1.500e+05 Min. :0.000e+00
## 1st Qu.: 3.093e+07 1st Qu.: 0.000e+00 1st Qu.:1.549e+07
## Median : 1.909e+08 Median : 0.000e+00 Median :7.382e+07
## Mean : 2.188e+09 Mean : 9.402e+07 Mean :9.307e+08
## 3rd Qu.: 8.923e+08 3rd Qu.: 9.911e+06 3rd Qu.:3.510e+08
## Max. : 4.622e+11 Max. : 1.154e+10 Max. :1.857e+11
## NA's :52 NA's :136 NA's :59
## Operating.Expenses Operating.Income Interest.Expense
## Min. :-1.088e+09 Min. :-6.786e+09 Min. :-2.250e+08
## 1st Qu.: 3.107e+07 1st Qu.:-1.308e+06 1st Qu.: 0.000e+00
## Median : 1.387e+08 Median : 4.104e+07 Median : 2.563e+06
## Mean : 1.438e+09 Mean : 6.748e+08 Mean : 1.002e+08
## 3rd Qu.: 5.885e+08 3rd Qu.: 2.713e+08 3rd Qu.: 4.300e+07
## Max. : 3.056e+11 Max. : 1.566e+11 Max. : 3.152e+10
## NA's :63 NA's :55 NA's :63
## Earnings.before.Tax Income.Tax.Expense Net.Income...Non.Controlling.int
## Min. :-8.878e+09 Min. :-2.081e+09 Min. :-1.587e+09
## 1st Qu.:-3.733e+06 1st Qu.: 0.000e+00 1st Qu.: 0.000e+00
## Median : 2.843e+07 Median : 5.335e+06 Median : 0.000e+00
## Mean : 5.726e+08 Mean : 1.763e+08 Mean : 1.558e+07
## 3rd Qu.: 2.194e+08 3rd Qu.: 5.753e+07 3rd Qu.: 0.000e+00
## Max. : 8.720e+10 Max. : 3.971e+10 Max. : 4.917e+09
## NA's :80 NA's :66 NA's :149
## Net.Income...Discontinued.ops Net.Income Preferred.Dividends
## Min. :-7.015e+09 Min. :-8.360e+09 Min. : -12628000
## 1st Qu.: 0.000e+00 1st Qu.:-3.693e+06 1st Qu.: 0
## Median : 0.000e+00 Median : 2.223e+07 Median : 0
## Mean :-3.454e+06 Mean : 4.894e+08 Mean : 5343177
## 3rd Qu.: 0.000e+00 3rd Qu.: 1.658e+08 3rd Qu.: 0
## Max. : 8.368e+09 Max. : 2.340e+11 Max. :2741588000
## NA's :149 NA's :23 NA's :149
## Net.Income.Com EPS
## Min. :-8.360e+09 Min. :-101870898
## 1st Qu.:-4.369e+06 1st Qu.: 0
## Median : 2.101e+07 Median : 1
## Mean : 4.839e+08 Mean : -26074
## 3rd Qu.: 1.631e+08 3rd Qu.: 2
## Max. : 2.340e+11 Max. : 8028004
## NA's :15 NA's :72