为什么 PerformanceAnalytics R 包中的 VaR 方法返回错误“VaR 计算产生不可靠的结果”

25-02-10 17

本篇文章给大家谈谈为什么PerformanceAnalyticsR包中的VaR方法返回错误“VaR计算产生不可靠的结果”，同时本文还将给你拓展(转)EnsembleMethodsforDeepLear

本篇文章给大家谈谈为什么 PerformanceAnalytics R 包中的 VaR 方法返回错误“VaR 计算产生不可靠的结果”，同时本文还将给你拓展(转) Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance、11g SPA (sql Performance Analyze) 进行升级测试、5 Ways to Use Log Data to Analyze System Performance--reference、A small instance of visual analytics basing Spark(Python)等相关知识，希望对各位有所帮助，不要忘了收藏本站喔。

本文目录一览：

为什么 PerformanceAnalytics R 包中的 VaR 方法返回错误“VaR 计算产生不可靠的结果”
(转) Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance
11g SPA (sql Performance Analyze) 进行升级测试
5 Ways to Use Log Data to Analyze System Performance--reference
A small instance of visual analytics basing Spark(Python)

为什么 PerformanceAnalytics R 包中的 VaR 方法返回错误“VaR 计算产生不可靠的结果”

如何解决为什么 PerformanceAnalytics R 包中的 VaR 方法返回错误“VaR 计算产生不可靠的结果”？

我的 R 代码：

library(PerformanceAnalytics)

prices <- c(10.4,11,10.11,9.19,10.63,9.68,12.89,9.8,12.57,8.23,9.27,9.51,10.51,9.66,9.52,10.78,9.47,11.87,11.33,11.38,11.16,8.94)

returns <- diff(prices)

print(returns)

VaR(returns,p=.95,method="historical")
VaR(returns,method="gaussian")
VaR(returns,method="modified")

输出：

print(returns)
 [1]  0.60 -0.89 -0.92  1.44 -0.95  3.21 -3.09  2.77 -4.34  1.04  0.24  1.00 -0.85 -0.14  1.26
[16] -1.31  2.40 -0.54  0.05 -0.22 -2.22

VaR(returns,method="historical")
VaR calculation produces unreliable result (risk over 100%) for column: 1 : 3.09
    [,1]
VaR   -1

VaR(returns,method="gaussian")
VaR calculation produces unreliable result (risk over 100%) for column: 1 : 3.03916083148501
    [,method="modified")
VaR calculation produces unreliable result (risk over 100%) for column: 1 : 3.1926697487747
    [,1]
VaR   -1

第一个参数在帮助中描述为 “资产返回的xts、VECTOR、矩阵、数据框、时间序列或动物园对象”

有什么问题？错误在哪里？

解决方法

尝试计算对数回报，然后计算 VaR

returns <- diff(log(prices))
VaR(returns,p = 0.95,method = "historical")

还要检查 PerformanceAnalytics 库中的 ES（预期短缺）函数。我个人认为比 VaR 好很多。

在考虑百分比回报时也有效。

returns <- rep(NA,length(prices) - 1)
for (i in 1:length(returns))
  returns[i] <- (prices[i+1]-prices[i])/prices[i]

(转) Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance

Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance

2018-12-19 13:02:45

This blog is copied from: https://machinelearningmastery.com/ensemble-methods-for-deep-learning-neural-networks/

Deep learning neural networks are nonlinear methods.

They offer increased flexibility and can scale in proportion to the amount of training data available. A downside of this flexibility is that they learn via a stochastic training algorithm which means that they are sensitive to the specifics of the training data and may find a different set of weights each time they are trained, which in turn produce different predictions.

Generally, this is referred to as neural networks having a high variance and it can be frustrating when trying to develop a final model to use for making predictions.

A successful approach to reducing the variance of neural network models is to train multiple models instead of a single model and to combine the predictions from these models. This is called ensemble learning and not only reduces the variance of predictions but also can result in predictions that are better than any single model.

In this post, you will discover methods for deep learning neural networks to reduce variance and improve prediction performance.

After reading this post, you will know:

Neural network models are nonlinear and have a high variance, which can be frustrating when preparing a final model for making predictions.
Ensemble learning combines the predictions from multiple neural network models to reduce the variance of predictions and reduce generalization error.
Techniques for ensemble learning can be grouped by the element that is varied, such as training data, the model, and how predictions are combined.

Let’s get started.

Ensemble Methods to Reduce Variance and Improve Performance of Deep Learning Neural Networks
Photo by University of San Francisco’s Performing Arts, some rights reserved.

Overview

This tutorial is divided into four parts; they are:

High Variance of Neural Network Models
Reduce Variance Using an Ensemble of Models
How to Ensemble Neural Network Models
Summary of Ensemble Techniques

High Variance of Neural Network Models

Training deep neural networks can be very computationally expensive.

Very deep networks trained on millions of examples may take days, weeks, and sometimes months to train.

Google’s baseline model […] was a deep convolutional neural network […] that had been trained for about six months using asynchronous stochastic gradient descent on a large number of cores.

— Distilling the Knowledge in a Neural Network, 2015.

After the investment of so much time and resources, there is no guarantee that the final model will have low generalization error, performing well on examples not seen during training.

… train many different candidate networks and then to select the best, […] and to discard the rest. There are two disadvantages with such an approach. First, all of the effort involved in training the remaining networks is wasted. Second, […] the network which had best performance on the validation set might not be the one with the best performance on new test data.

— Pages 364-365, Neural Networks for Pattern Recognition, 1995.

Neural network models are a nonlinear method. This means that they can learn complex nonlinear relationships in the data. A downside of this flexibility is that they are sensitive to initial conditions, both in terms of the initial random weights and in terms of the statistical noise in the training dataset.

This stochastic nature of the learning algorithm means that each time a neural network model is trained, it may learn a slightly (or dramatically) different version of the mapping function from inputs to outputs, that in turn will have different performance on the training and holdout datasets.

As such, we can think of a neural network as a method that has a low bias and high variance. Even when trained on large datasets to satisfy the high variance, having any variance in a final model that is intended to be used to make predictions can be frustrating.

Want Better Results with Deep Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Reduce Variance Using an Ensemble of Models

A solution to the high variance of neural networks is to train multiple models and combine their predictions.

The idea is to combine the predictions from multiple good but different models.

A good model has skill, meaning that its predictions are better than random chance. Importantly, the models must be good in different ways; they must make different prediction errors.

The reason that model averaging works is that different models will usually not make all the same errors on the test set.

— Page 256, Deep Learning, 2016.

Combining the predictions from multiple neural networks adds a bias that in turn counters the variance of a single trained neural network model. The results are predictions that are less sensitive to the specifics of the training data, choice of training scheme, and the serendipity of a single training run.

In addition to reducing the variance in the prediction, the ensemble can also result in better predictions than any single best model.

… the performance of a committee can be better than the performance of the best single network used in isolation.

— Page 365, Neural Networks for Pattern Recognition, 1995.

This approach belongs to a general class of methods called “ensemble learning” that describes methods that attempt to make the best use of the predictions from multiple models prepared for the same problem.

Generally, ensemble learning involves training more than one network on the same dataset, then using each of the trained models to make a prediction before combining the predictions in some way to make a final outcome or prediction.

In fact, ensembling of models is a standard approach in applied machine learning to ensure that the most stable and best possible prediction is made.

For example, Alex Krizhevsky, et al. in their famous 2012 paper titled “Imagenet classification with deep convolutional neural networks” that introduced very deep convolutional neural networks for photo classification (i.e. AlexNet) used model averaging across multiple well-performing CNN models to achieve state-of-the-art results at the time. Performance of one model was compared to ensemble predictions averaged over two, five, and seven different models.

Averaging the predictions of five similar CNNs gives an error rate of 16.4%. […] Averaging the predictions of two CNNs that were pre-trained […] with the aforementioned five CNNs gives an error rate of 15.3%.

Ensembling is also the approach used by winners in machine learning competitions.

Another powerful technique for obtaining the best possible results on a task is model ensembling. […] If you look at machine-learning competitions, in particular on Kaggle, you’ll see that the winners use very large ensembles of models that inevitably beat any single model, no matter how good.

— Page 264, Deep Learning With Python, 2017.

How to Ensemble Neural Network Models

Perhaps the oldest and still most commonly used ensembling approach for neural networks is called a “committee of networks.”

A collection of networks with the same configuration and different initial random weights is trained on the same dataset. Each model is then used to make a prediction and the actual prediction is calculated as the average of the predictions.

The number of models in the ensemble is often kept small both because of the computational expense in training models and because of the diminishing returns in performance from adding more ensemble members. Ensembles may be as small as three, five, or 10 trained models.

The field of ensemble learning is well studied and there are many variations on this simple theme.

It can be helpful to think of varying each of the three major elements of the ensemble method; for example:

Training Data: Vary the choice of data used to train each model in the ensemble.
Ensemble Models: Vary the choice of the models used in the ensemble.
Combinations: Vary the choice of the way that outcomes from ensemble members are combined.

Let’s take a closer look at each element in turn.

Varying Training Data

The data used to train each member of the ensemble can be varied.

The simplest approach would be to use k-fold cross-validation to estimate the generalization error of the chosen model configuration. In this procedure, k different models are trained on k different subsets of the training data. These k models can then be saved and used as members of an ensemble.

Another popular approach involves resampling the training dataset with replacement, then training a network using the resampled dataset. The resampling procedure means that the composition of each training dataset is different with the possibility of duplicated examples allowing the model trained on the dataset to have a slightly different expectation of the density of the samples, and in turn different generalization error.

This approach is called bootstrap aggregation, or bagging for short, and was designed for use with unpruned decision trees that have high variance and low bias. Typically a large number of decision trees are used, such as hundreds or thousands, given that they are fast to prepare.

… a natural way to reduce the variance and hence increase the prediction accuracy of a statistical learning method is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions. […] Of course, this is not practical because we generally do not have access to multiple training sets. Instead, we can bootstrap, by taking repeated samples from the (single) training data set.

— Pages 216-317, An Introduction to Statistical Learning with Applications in R, 2013.

An equivalent approach might be to use a smaller subset of the training dataset without regularization to allow faster training and some overfitting.

The desire for slightly under-optimized models applies to the selection of ensemble members more generally.

… the members of the committee should not individually be chosen to have optimal trade-off between bias and variance, but should have relatively smaller bias, since the extra variance can be removed by averaging.

— Page 366, Neural Networks for Pattern Recognition, 1995.

Other approaches may involve selecting a random subspace of the input space to allocate to each model, such as a subset of the hyper-volume in the input space or a subset of input features.

Varying Models

Training the same under-constrained model on the same data with different initial conditions will result in different models given the difficulty of the problem, and the stochastic nature of the learning algorithm.

This is because the optimization problem that the network is trying to solve is so challenging that there are many “good” and “different” solutions to map inputs to outputs.

Most neural network algorithms achieve sub-optimal performance specifically due to the existence of an overwhelming number of sub-optimal local minima. If we take a set of neural networks which have converged to local minima and apply averaging we can construct an improved estimate. One way to understand this fact is to consider that, in general, networks which have fallen into different local minima will perform poorly in different regions of feature space and thus their error terms will not be strongly correlated.

— When networks disagree: Ensemble methods for hybrid neural networks, 1995.

This may result in a reduced variance, but may not dramatically improve generalization error. The errors made by the models may still be too highly correlated because the models all have learned similar mapping functions.

An alternative approach might be to vary the configuration of each ensemble model, such as using networks with different capacity (e.g. number of layers or nodes) or models trained under different conditions (e.g. learning rate or regularization).

The result may be an ensemble of models that have learned a more heterogeneous collection of mapping functions and in turn have a lower correlation in their predictions and prediction errors.

Differences in random initialization, random selection of minibatches, differences in hyperparameters, or different outcomes of non-deterministic implementations of neural networks are often enough to cause different members of the ensemble to make partially independent errors.

— Pages 257-258, Deep Learning, 2016.

Such an ensemble of differently configured models can be achieved through the normal process of developing the network and tuning its hyperparameters. Each model could be saved during this process and a subset of better models chosen to comprise the ensemble.

Slightly inferiorly trained networks are a free by-product of most tuning algorithms; it is desirable to use such extra copies even when their performance is significantly worse than the best performance found. Better performance yet can be achieved through careful planning for an ensemble classification by using the best available parameters and training different copies on different subsets of the available database.

— Neural Network Ensembles, 1990.

In cases where a single model may take weeks or months to train, another alternative may be to periodically save the best model during the training process, called snapshot or checkpoint models, then select ensemble members among the saved models. This provides the benefits of having multiple models trained on the same data, although collected during a single training run.

Snapshot Ensembling produces an ensemble of accurate and diverse models from a single training process. At the heart of Snapshot Ensembling is an optimization process which visits several local minima before converging to a final solution. We take model snapshots at these various minima, and average their predictions at test time.

— Snapshot Ensembles: Train 1, get M for free, 2017.

A variation on the Snapshot ensemble is to save models from a range of epochs, perhaps identified by reviewing learning curves of model performance on the train and validation datasets during training. Ensembles from such contiguous sequences of models are referred to as horizontal ensembles.

First, networks trained for a relatively stable range of epoch are selected. The predictions of the probability of each label are produced by standard classifiers [over] the selected epoch[s], and then averaged.

— Horizontal and vertical ensemble with deep representation for classification, 2013.

A further enhancement of the snapshot ensemble is to systematically vary the optimization procedure during training to force different solutions (i.e. sets of weights), the best of which can be saved to checkpoints. This might involve injecting an oscillating amount of noise over training epochs or oscillating the learning rate during training epochs. A variation of this approach called Stochastic Gradient Descent with Warm Restarts (SGDR) demonstrated faster learning and state-of-the-art results for standard photo classification tasks.

Our SGDR simulates warm restarts by scheduling the learning rate to achieve competitive results […] roughly two to four times faster. We also achieved new state-of-the-art results with SGDR, mainly by using even wider [models] and ensembles of snapshots from SGDR’s trajectory.

— SGDR: Stochastic Gradient Descent with Warm Restarts, 2016.

A benefit of very deep neural networks is that the intermediate hidden layers provide a learned representation of the low-resolution input data. The hidden layers can output their internal representations directly, and the output from one or more hidden layers from one very deep network can be used as input to a new classification model. This is perhaps most effective when the deep model is trained using an autoencoder model. This type of ensemble is referred to as a vertical ensemble.

This method ensembles a series of classifiers whose inputs are the representation of intermediate layers. A lower error rate is expected because these features seem diverse.

— Horizontal and vertical ensemble with deep representation for classification, 2013.

Varying Combinations

The simplest way to combine the predictions is to calculate the average of the predictions from the ensemble members.

This can be improved slightly by weighting the predictions from each model, where the weights are optimized using a hold-out validation dataset. This provides a weighted average ensemble that is sometimes called model blending.

… we might expect that some members of the committee will typically make better predictions than other members. We would therefore expect to be able to reduce the error still further if we give greater weight to some committee members than to others. Thus, we consider a generalized committee prediction given by a weighted combination of the predictions of the members …

— Page 367, Neural Networks for Pattern Recognition, 1995.

One further step in complexity involves using a new model to learn how to best combine the predictions from each ensemble member.

The model could be a simple linear model (e.g. much like the weighted average), but could be a sophisticated nonlinear method that also considers the specific input sample in addition to the predictions provided by each member. This general approach of learning a new model is called model stacking, or stacked generalization.

Stacked generalization works by deducing the biases of the generalizer(s) with respect to a provided learning set. This deduction proceeds by generalizing in a second space whose inputs are (for example) the guesses of the original generalizers when taught with part of the learning set and trying to guess the rest of it, and whose output is (for example) the correct guess. […] When used with a single generalizer, stacked generalization is a scheme for estimating (and then correcting for) the error of a generalizer which has been trained on a particular learning set and then asked a particular question.

— Stacked generalization, 1992.

There are more sophisticated methods for stacking models, such as boosting where ensemble members are added one at a time in order to correct the mistakes of prior models. The added complexity means this approach is less often used with large neural network models.

Another combination that is a little bit different is to combine the weights of multiple neural networks with the same structure. The weights of multiple networks can be averaged, to hopefully result in a new single model that has better overall performance than any original model. This approach is called model weight averaging.

… suggests it is promising to average these points in weight space, and use a network with these averaged weights, instead of forming an ensemble by averaging the outputs of networks in model space

— Averaging Weights Leads to Wider Optima and Better Generalization, 2018.

Summary of Ensemble Techniques

In summary, we can list some of the more common and interesting ensemble methods for neural networks organized by each element of the method that can be varied, as follows:

Varying Training Data
- k-fold Cross-Validation Ensemble
- Bootstrap Aggregation (bagging) Ensemble
- Random Training Subset Ensemble
Varying Models
- Multiple Training Run Ensemble
- Hyperparameter Tuning Ensemble
- Snapshot Ensemble
- Horizontal Epochs Ensemble
- Vertical Representational Ensemble
Varying Combinations
- Model Averaging Ensemble
- Weighted Average Ensemble
- Stacked Generalization (stacking) Ensemble
- Boosting Ensemble
- Model Weight Averaging Ensemble

There is no single best ensemble method; perhaps experiment with a few approaches or let the constraints of your project guide you.

Summary

In this post, you discovered ensemble methods for deep learning neural networks to reduce variance and improve prediction performance.

Specifically, you learned:

Neural network models are nonlinear and have a high variance, which can be frustrating when preparing a final model for making predictions.
Ensemble learning combines the predictions from multiple neural network models to reduce the variance of predictions and reduce generalization error.
Techniques for ensemble learning can be grouped by the element that is varied, such as training data, the model, and how predictions are combined.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

11g SPA (sql Performance Analyze) 进行升级测试

11G的新特性SPA(SQL Performance Analyze)现在被广泛的应用到升级和迁移的场景。当然还有一些其他的场景可以考虑使用，比如(参数修改，I/O子系统变更)，但是主要是为了帮助我们检测升级之后性能退化的那些SQL语句，用以防止升级后SQL性能退化导致无法使用的

11g的新特性spa(sql performance analyze)现在被广泛的应用到升级和迁移的场景。当然还有一些其他的场景可以考虑使用，比如(参数修改，i/o子系统变更)，但是主要是为了帮助我们检测升级之后性能退化的那些sql语句，用以防止升级后sql性能退化导致无法使用的问题。如下图所示：

SPA的主要功能集实施步骤如下：

在生产系统上捕捉SQL负载，并生成SQL Tuning Set；
创建一个中转表，将SQL Tuning Set导入到中转表，导出中转表并传输到测试库；
导入中转表，并解压中转表的数据到SQL Tuning Set；
创建SPA任务，先生成10g的trail，然后在11g中再生成11g的trail；
执行比较任务，再生成SPA报告；
分析性能退化的SQL语句；

在使用SPA的时候，首先我们一定要阅读文档：Using Real Application Testing Functionality in Earlier Releases (文档 ID 560977.1)，主要是阅读Table 3: SQL Performance Analyzer Availability Information。这个表格告诉我们，我们可以确认从那个源端版本到那个目标版本做SPA需要安装那些必要的补丁。

1.在生产系统上捕捉SQL负载，并生成SQL Tuning Set

这个步骤其实不是很复杂，我的一篇文章介绍过关于这个采集的过程。其实采集的方法有很多种，主要是：

cursor cache
awr snapshots
awr baseline
another sql set
10046 trace file(11g+)

我们一般使用的是游标采集和AWR历史资料库采集的方式。游标采集可以最大限度的帮助我们采集到更多的SQL语句。为了保证采集到更多的SQL，我们需要进行一个长期的捕捉，每天捕捉好几次。我们在一个生产环境做的是捕捉4次/天。而AWR历史资料库可以帮我们采集到TOP的SQL语句。我们生产环境的项目里面是采集的是一个月的AWR数据。这两份的合集加在一起基本上是系统中一个比较完整的SQL清单。

【注】采集的过程中可能因为有literal sql，这会导致我们的SQLSET的结果集非常大，因为相关的表涉及到一些CLOB字段，如果结果集过大的话，将导致转换成中间表非常的慢。转换到一半因为UNDO不够大，还还会导致出现ORA-01555错误。为了解决这个问题，我建议在采集的过程中实施过滤。具体参考我写的文档：SPA游标采集之去除重复

--------------新建spa用户及赋权
SQL&gt; create user spa identified by spa default tablespace spa;
User created.
SQL&gt; grant connect ,resource to spa;
Grant succeeded.
SQL&gt; grant ADMINISTER SQL TUNING SET to spa;
Grant succeeded.
SQL&gt; grant execute on dbms_sqltune to spa;
Grant succeeded.
SQL&gt; grant select any dictionary to spa;
Grant succeeded.
-------------创建sql优化集
SQL&gt; exec dbms_sqltune.create_sqlset(''sql_test'');
PL/SQL procedure successfully completed.
SQL&gt; select name,OWNER,CREATED,STATEMENT_COUNT from dba_sqlset;
NAME                           OWNER                          CREATED      STATEMENT_COUNT
------------------------------ ------------------------------ ------------ ---------------
sql_test                       SPA                            18-APR-14                0
--------------执行从游标采集SQL
DECLARE
  mycur DBMS_SQLTUNE.SQLSET_CURSOR;
BEGIN
  OPEN mycur FOR
    SELECT value(P)
      FROM TABLE(dbms_sqltune.select_cursor_cache(''parsing_schema_name in (''''ORAADMIN'''')'',
                                                  NULL,
                                                  NULL,
                                                  NULL,
                                                  NULL,
                                                  1,
                                                  NULL,
                                                  ''ALL'')) p;
  dbms_sqltune.load_sqlset(sqlset_name     =&gt; ''sql_test'',
                           populate_cursor =&gt; mycur,
                           load_option     =&gt; ''MERGE'');
  CLOSE mycur;
END;
/

登录后复制

关于采集，可以参考文档：How to Load Queries into a SQL Tuning Set (STS) (文档 ID 1271343.1)

2.创建一个中转表，将SQL Tuning Set导入到中转表，导出中转表并传输到测试库；

这个步骤比较简单，但是需要注意的一点是：如果你的游标数量比较多的话，需要注意在转换过程中容易出现ORA-01555的错误。建议最好把undo retention设置大一些。

-------------不要使用sys用户创建stgtab表
DBMS_SQLTUNE.create_stgtab_sqlset(table_name =&gt; ''SQLSET_TAB'',
schema_name =&gt; ''SPA'',
tablespace_name =&gt; ''SYSAUX'');
END;
/ 
-------------将优化集打包到stgtab表里面
BEGIN
DBMS_SQLTUNE.pack_stgtab_sqlset(sqlset_name =&gt; ''spa_test'',
sqlset_owner =&gt; ''SPA'',
staging_table_name =&gt; ''SQLSET_TAB'',
staging_schema_owner =&gt; ''SPA'');
END;
/

登录后复制

转换成中转表之后，我们可以再做一次去除重复的操作。当然，你也可以根据module来删除一些不必要的游标。

delete from SPA.SQLSET_TAB a where rowid !=(select max(rowid) from SQLSET_TAB b where a.FORCE_MATCHING_SIGNATURE=b.FORCE_MATCHING_SIGNATURE and a.FORCE_MATCHING_SIGNATURE0);
delete from SPA.SQLSET_TAB where MODULE=''PL/SQL Developer'';

登录后复制

3.导入中转表，并解压中转表的数据到SQL Tuning Set；

这个步骤我们需要把我们导出的中转表的数据迁移到测试平台，然后导入数据，并再一次转换成11g的SQL Tuning Set里面；

-------------导入数据到测试系统
export NLS_LANG=American_America.zhs16gbk
imp spa/spa fromuser=spa touser=spa file=/home/oracle/spa/SQLSET_TAB.dmp feedback=100
-------------创建sqlset
SQL&gt; connect spa/spa
Connected.
SQL&gt; exec DBMS_SQLTUNE.create_sqlset(sqlset_name =&gt; ''sql_test'');
PL/SQL procedure successfully completed.
-------------unpack到sqlset
SQL&gt; BEGIN
  2  DBMS_SQLTUNE.unpack_stgtab_sqlset(sqlset_name =&gt; ''sql_test'',
  3  sqlset_owner =&gt; ''SPA'',
  4  replace =&gt; TRUE,
  5  staging_table_name =&gt; ''SQLSET_TAB'',
  6  staging_schema_owner =&gt; ''SPA'');
  7  END;
  8  /
PL/SQL procedure successfully completed.

登录后复制

如果在你源端和目标端SQL SET的name,或者owner不同，需要你使用remap_stgtab_sqlset方法对SQL SET的name和owner进行转换。

exec dbms_sqltune.remap_stgtab_sqlset(old_sqlset_name =&gt;''sql_test_aaa'',old_sqlset_owner =&gt; ''aaa'', new_sqlset_name =&gt; ''sql_test'',new_sqlset_owner =&gt; ''SPA'', staging_table_name =&gt; ''SQLSET_TAB'',staging_schema_owner =&gt; ''SPA'');

登录后复制

导入导出SQLSET，可以参考文档：How to Move a SQL Tuning Set from One Database to Another (文档 ID 751068.1)

4.创建SPA任务，先生成10g的trail，然后在11g中再生成11g的trail；

这个步骤一定要注意一点，先检查测试库上面有没有dblink，如果有的话一定要删除，免得连接到其他库做一些不必要的动作，然后就是在11g中生成11g的trail的时间可能比较慢，最好写成脚本放在后台执行。

-------------新建SPA任务
var tname varchar2(30);
var sname varchar2(30);
exec :sname := ''sql_test'';
exec :tname := ''SPA_TEST'';
exec :tname := DBMS_SQLPA.CREATE_ANALYSIS_TASK(sqlset_name =&gt; :sname, task_name =&gt; :tname);
-------------生成10g的trail
begin
DBMS_SQLPA.EXECUTE_ANALYSIS_TASK(
task_name =&gt; ''SPA_TEST'',
execution_type =&gt; ''CONVERT SQLSET'',
execution_name =&gt; ''CONVERT_10G'');
end;
/
-------------清空shared pool和buffer cache
alter system flush shared_pool;
alter system flush BUFFER_CACHE;
-------------生成11g的trail
begin
DBMS_SQLPA.EXECUTE_ANALYSIS_TASK(
task_name =&gt; ''SPA_TEST'',
execution_type =&gt; ''TEST EXECUTE'',
execution_name =&gt; ''EXEC_11G'');
end;
/

登录后复制

5.执行比较任务，再生成SPA报告；

我们可以从三个维度来进行对比,包括执行时间、CPU_TIME、Buffer_GET等.

-------------从elapsed_time来进行比较
begin
DBMS_SQLPA.EXECUTE_ANALYSIS_TASK(
task_name =&gt; ''SPA_TEST'',
execution_type =&gt; ''COMPARE PERFORMANCE'',
execution_name =&gt; ''Compare_elapsed_time'',
execution_params =&gt; dbms_advisor.arglist(''execution_name1'', ''CONVERT_10G'', ''execution_name2'', ''EXEC_11G'', ''comparison_metric'', ''elapsed_time'') );
end;
/
-------------从cpu_time来进行比较
begin
DBMS_SQLPA.EXECUTE_ANALYSIS_TASK(
task_name =&gt; ''SPA_TEST'',
execution_type =&gt; ''COMPARE PERFORMANCE'',
execution_name =&gt; ''Compare_CPU_time'',
execution_params =&gt; dbms_advisor.arglist(''execution_name1'', ''CONVERT_10G'', ''execution_name2'', ''EXEC_11G'', ''comparison_metric'', ''CPU_TIME'') );
end;
/
-------------从buffer_gets来进行比较
begin
DBMS_SQLPA.EXECUTE_ANALYSIS_TASK(
task_name =&gt; ''SPA_TEST'',
execution_type =&gt; ''COMPARE PERFORMANCE'',
execution_name =&gt; ''Compare_BUFFER_GETS_time'',
execution_params =&gt; dbms_advisor.arglist(''execution_name1'', ''CONVERT_10G'', ''execution_name2'', ''EXEC_11G'', ''comparison_metric'', ''BUFFER_GETS'') );
end;
-------------生成SPA报告
set trimspool on
set trim on
set pages 0
set long 999999999
set linesize 1000
spool spa_report_elapsed_time.html
SELECT dbms_sqlpa.report_analysis_task(''SPA_TEST'', ''HTML'', ''ALL'',''ALL'', top_sql=&gt;300,execution_name=&gt;''Compare_elapsed_time'') FROM dual;
spool off;
spool spa_report_CPU_time.html
SELECT dbms_sqlpa.report_analysis_task(''SPA_TEST'', ''HTML'', ''ALL'',''ALL'', top_sql=&gt;300,execution_name=&gt;''Compare_CPU_time'') FROM dual;
spool off;
spool spa_report_buffer_time.html
SELECT dbms_sqlpa.report_analysis_task(''SPA_TEST'',''HTML'',''ALL'',''ALL'',top_sql=&gt;300,execution_name=&gt;''Compare_BUFFER_GETS_time'') FROM dual;
spool off;
spool spa_report_errors.html
SELECT dbms_sqlpa.report_analysis_task(''SPA_TEST'', ''HTML'', ''errors'',''summary'') FROM dual;
spool off;
spool spa_report_unsupport.html
SELECT dbms_sqlpa.report_analysis_task(''SPA_TEST'', ''HTML'', ''unsupported'',''all'') FROM dual;
spool off;
/

登录后复制

6.分析性能退化的SQL语句；

QQ图片20140527104219

生成完报告后，一共有5份，都需要我们逐一的去分析。我们从ELAPSED_TIME、CPU_TIME、Buffer_GET这三个报告中，我们可以查看到性能下降的SQL。有的SQL可能是CPU TIME有所升高，有的SQL可能是buffer gets有所升高，有的SQL可能这三方面都有所升高。这都是我们需要检查的。这些SQL的性能的退化，有可能执行计划发生了变化，有可能执行计划未变，要找出执行计划变化的原因，这需要我们对SQL优化和优化器、统计信息等有一个很深入的研究。

还有2份报告是errors和unsupport的语句，这类语句我们还是要看一下，一般情况就是有些是因为数据有差异，会出现invalid ROWID等情况。这些不用过多去关注，因为并不是所有的语句都能够精确分析，还有一些insert语句是unsupport的，我们只要分析大部分语句的问题即可。

参考文档：

How to Load Queries into a SQL Tuning Set (STS) (文档 ID 1271343.1)

How to Move a SQL Tuning Set from One Database to Another (文档 ID 751068.1)

Oracle? Database Real Application Testing User’s Guide 11g Release 2 (11.2)

原文地址：11g SPA (sql Performance Analyze) 进行升级测试, 感谢原作者分享。

5 Ways to Use Log Data to Analyze System Performance--reference

<div>

Recently we looked across some of the most common behaviors that our community of 25,000 users looked for in their logs with a particular focus on web server logs. In fact our research identified the top 15 web server tags and alerts created by our customers – you can read more about these from in our section – and you can also easily create tags or alerts based on the patterns to identify these behaviours in your systems.

This week we are focusing on performance analysis using log data. Again we looked across our community of over 25,000 users and identified 5 ways in which people use log data to analyze system performance. As always customer data was anonymized and privacy protected. Over the course of the next week we will be diving into each of these area’s in more detail and will feature customers first hand accounts of how they are using logs to help identify and resolve such issues in their systems.

Our research looked at more than 200k patterns from across our Community to identify important events in their log data. With a particular focus on performance related issues we identified the following 5 areas as trending and common across our user base.:

1. Slow Response Times:Response times are one of the most common and useful performance measures that are available from your log data. They give you an immediate understanding of how long a request is taking to be returned. For example web server logs can give you insight into how long a request takes to return a response to a client device. This can include time taken for the different components behind your web server (application servers,DBs) to process the request so it can give an immediate view as to how well your application is performing. Recording response times from the client device/broswer can give you an even more complete picture since it also captures page load time in the app/browser as well as network latency.

A good rule of thumb when measuring response times is to as outlined by Jakob Nielsen in his publication on ‘Usability Engineering’ back in 1993 that is still relevant today. In short 0.1 second is about the limit for having the user feel that the system is reacting instantaneously,1.0 second is about the limit for the user’s flow of thought to stay uninterrupted,and 10 seconds is about the limit for keeping the user’s attention focused on the dialogue.

Slow response time patterns almost always follow the pattern below:

response_time>X

Where response_time is the field value representing the server or client’s response and ‘X’ is a threshold,which if exceeded,you want the event to be highlighted or a notification to be sent so that you and your team are aware that somebody is having a poor user experience.

2. Memory Issues and Garbage Collection: Outofmemory errors can be pretty catastrophic when they occur as they often result in the application crashing due to lack of resources. Thus you want to know about these when they occur and creating tags and generating notifications via alerts when these events occur is always recommended.

However a leading indicator of outofmemory issues can be your garbage collection behavior, thus tracking this and getting notified if heap used vs free heap space is over a particular threshold,or if garbage collection is taking a long time can be particularly useful and can often point you in the direction of memory leaks. Identifying a memory leak before an out of memory exception can be the difference between a major system outage and a simple server restart until the issue is patched.

Furthermore slow or long garbage collection can also be one of the reasons for user’s experiencing slow application behavior as during garbage collection your system can slow down or in some situations it blocks until garbage collection is complete (e.g. with ‘stop the world’ garbage collection).

Below are some examples of common patterns used to identify some of the memory related issues outlined above:

Out of memory
exceeds memory limit
memory leak detected
java.lang.OutOfMemoryError
System.OutOfMemoryException
memwatch:leak: Ended heapDiff
GC AND stats

3. Deadlocks and Threading Issues

Deadlocks can occur in many shapes and sizes and can have pretty bad effects when they occur – everywhere from bringing your system to a complete halt to simply slowing it down. In short,a deadlock is a situation in which two or more competing actions are each waiting for the other to finish,and thus neither ever does. For example, we say that a set of processes or threads is deadlocked when each thread is waiting for an event that only another process in the set can cause.

Not surprisingly deadlocks feature as one of our top 5 performance related issues that our users write patterns to detect in their systems.

Most deadlock patterns simply contain the keyword ‘deadlock’,but some of the common patterns follow the following structure:

‘deadlock’
‘Deadlock found when trying to get lock’
‘Unexpected error while processing request: deadlock;’

4. High Resource Usage (CPU/Disk/ Network)

In many cases a slow down in system performance may not be as a result of any major software flaw,but can be a simple case of the load on your system increasing,yet not having increased resources available to deal with this. Tracking resource usage can allow you to see when you require additional capacity such that you can kick off more server instances for example.

Example patterns used when analysing resource usage:

metric=/CPUUtilization/ AND minimum>X
cpu>X
disk>X
disk is at or near capacity
not enough space on the disk
java.io.IOException: No space left on device
insufficient bandwidth

5. Database Issues and Slow Queries

Knowing when a query failed can be useful as it allows you to identify situations when a request may have returned without the relevant data and thus helps you identify when users are not getting the data they need. However more subtle issues can be when a user is getting the correct results but the results are taking a long time to return and while technically the system may be fine and bug free a slow user experience may be hurting your top line.

Tracking slow queries allows you to track how your DB queries are performing. Setting acceptable thresholds for query time and reporting on anything that exceeds these thresholds can help you quickly identify when your users experience is being effected.

Example patterns:

SqlException
SQL Timeout
Long query
Slow query
WARNING: Query took longer than X
Query_time > X

As always let us know if you think we have left out any important issues that you like to track in your logs. To start tracking your own system performance, and include these patterns listed above to automatically create tags and alerts relevant for your system.

A small instance of visual analytics basing Spark(Python)

The total delay time of the major airlines in a certain month

1.Preparation

1.1.Data

This data set was downloaded from the U.S. Department of Transportation, Office of the Secretary of Research on November 30, 2014 and represents flight data for the domestic United States in April of 2014.

The following CSV files are lookup tables:

- airlines.csv

- airports.csv

And provided detailed information about the reference codes in the main data set. These files have header rows to identify their fields.

The flights.csv contains flight statistics for April 2014 with the following fields:

- flight date (yyyy-mm-dd)

- airline id (lookup in airlines.csv)

- flight num

- origin (lookup in airports.csv)

- destination (lookup in airports.csv)

- departure time (HHMM)

- departure delay (minutes)

- arrival time (HHMM)

- arrival delay (minutes)

- air time (minutes)

- distance (miles)

1.2.Codes—a basic template

A basic template for writing a Spark application in Python is as follows:

## Spark Application - execute with spark-submit

## Imports

from pyspark import SparkConf, SparkContext

## Module Constants

APP_NAME = "My Spark Application"

## Closure Functions

## Main functionality

def main(sc):

    pass

if __name__ == "__main__":

    # Configure Spark

    conf = SparkConf().setAppName(APP_NAME)

    conf = conf.setMaster("local[*]")

    sc   = SparkContext(conf=conf)

    # Execute Main functionality

    main(sc)

1.3. Codes—a basic template

The entire app is as follows:

## Spark Application - execute with spark-submit
## Imports
import csv
import matplotlib.pyplot as plt
from StringIO import StringIO
from datetime import datetime
from collections import namedtuple
from operator import add, itemgetter
from pyspark import SparkConf, SparkContext
## Module Constants
APP_NAME = "Flight Delay Analysis"
DATE_FMT = "%Y-%m-%d"
TIME_FMT = "%H%M"
fields   = ('date', 'airline', 'flightnum', 'origin', 'dest', 'dep',
            'dep_delay', 'arv', 'arv_delay', 'airtime', 'distance')
Flight   = namedtuple('Flight', fields)
## Closure Functions
def parse(row):
    """
    Parses a row and returns a named tuple.
    """
    row[0]  = datetime.strptime(row[0], DATE_FMT).date()
    row[5]  = datetime.strptime(row[5], TIME_FMT).time()
    row[6]  = float(row[6])
    row[7]  = datetime.strptime(row[7], TIME_FMT).time()
    row[8]  = float(row[8])
    row[9]  = float(row[9])
    row[10] = float(row[10])
    return Flight(*row[:11])
def split(line):
    """
    Operator function for splitting a line with csv module
    """
    reader = csv.reader(StringIO(line))
    return reader.next()
def plot(delays):
    """
    Show a bar chart of the total delay per airline
    """
    airlines = [d[0] for d in delays]
    minutes  = [d[1] for d in delays]
    index    = list(xrange(len(airlines)))
    fig, axe = plt.subplots()
    bars = axe.barh(index, minutes)
    # Add the total minutes to the right
    for idx, air, min in zip(index, airlines, minutes):
        if min > 0:
            bars[idx].set_color('#d9230f')
            axe.annotate(" %0.0f min" % min, xy=(min+1, idx+0.5), va='center')
        else:
            bars[idx].set_color('#469408')
            axe.annotate(" %0.0f min" % min, xy=(10, idx+0.5), va='center')
    # Set the ticks
    ticks = plt.yticks([idx+ 0.5 for idx in index], airlines)
    xt = plt.xticks()[0]
    plt.xticks(xt, [' '] * len(xt))
    # minimize chart junk
    plt.grid(axis = 'x', color ='white', linestyle='-')
    plt.title('Total Minutes Delayed per Airline')
    plt.show()
## Main functionality
def main(sc):
    # Load the airlines lookup dictionary
    airlines = dict(sc.textFile("ontime/airlines.csv").map(split).collect())
    # broadcast the lookup dictionary to the cluster
    airline_lookup = sc.broadcast(airlines)
    # Read the CSV Data into an RDD
    flights = sc.textFile("ontime/flights.csv").map(split).map(parse)
    # Map the total delay to the airline (joined using the broadcast value)
    delays  = flights.map(lambda f: (airline_lookup.value[f.airline],
                                     add(f.dep_delay, f.arv_delay)))
    # Reduce the total delay for the month to the airline
    delays  = delays.reduceByKey(add).collect()
    delays  = sorted(delays, key=itemgetter(1))
    # Provide output from the driver
    for d in delays:
        print "%0.0f minutes delayed\t%s" % (d[1], d[0])
    # Show a bar chart of the delays
    plot(delays)
if __name__ == "__main__":
    # Configure Spark
    conf = SparkConf().setMaster("local[*]")
    conf = conf.setAppName(APP_NAME)
    sc   = SparkContext(conf=conf)
    # Execute Main functionality
    main(sc)

1.4. equipments

A Ubuntu computer with spark、jdk、Scala

2.Steps

2.1.overview

data

codes

2.2. use the spark-submit command as follows:

2.3.result

2.4.Analysis

what is this code doing? Let's look particularly at the main function which does the work most directly related to Spark. First, we load up a CSV file into an RDD, then map the split function to it. The split function parses each line of text using the csvmodule and returns a tuple that represents the row. Finally we pass the collect action to the RDD, which brings the data from the RDD back to the driver as a Python list. In this case, airlines.csv is a small jump table that will allow us to join airline codes with the airline full name. We will store this jump table as a Python dictionary and then broadcast it to every node in the cluster using sc.broadcast.

Next, the main function loads the much larger flights.csv. After splitting the CSV rows, we map the parse function to the CSV row, which converts dates and times to Python dates and times, and casts floating point numbers appropriately. It also stores the row as a NamedTuple called Flight for efficient ease of use.

With an RDD of Flight objects in hand, we map an anonymous function that transforms the RDD to a series of key-value pairs where the key is the name of the airline and the value is the sum of the arrival and departure delays. Each airline has its delay summed together using the reduceByKey action and the add operator, and this RDD is collected back to the driver (again the number airlines in the data is relatively small). Finally the delays are sorted in ascending order, then the output is printed to the console as well as visualized using matplotlib.

3.Q&A

3.1.ImportError: No module named matplotlib.pyplot

http://www.codeweblog.com/importerror-no-module-named-matplotlib-pyplot/

Note:

this demo came initially from the website

https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python

I also find the chinese version

http://blog.jobbole.com/86232/

data come from github

https://github.com/bbengfort/hadoop-fundamentals/blob/master/data/ontime.zip

今天关于为什么 PerformanceAnalytics R 包中的 VaR 方法返回错误“VaR 计算产生不可靠的结果”的讲解已经结束，谢谢您的阅读，如果想了解更多关于(转) Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance、11g SPA (sql Performance Analyze) 进行升级测试、5 Ways to Use Log Data to Analyze System Performance--reference、A small instance of visual analytics basing Spark(Python)的相关知识，请在本站搜索。

本文标签：