标题 |
创建时间 |
发布时间 |
最后一次修改 |
备注 |
Model 常见问题速查 |
2020.08.21 |
2020.12.21 |
2022.01.03 |
/ |
xgb 特征重要性
sorted_idx = np.argsort(model.feature_importances_)[::-1]
## 按重要性排序
for index in sorted_idx:
# print([train[feature].columns[index], model.feature_importances_[index]])
print([features[index], model.feature_importances_[index]])
## 打印结果
[round(elem, 3) for elem in model.feature_importances_]
## 绘制结果
from xgboost import plot_importance
plot_importance(model, max_num_features = 15)
## ps:引入随机列
import numpy as np
df1['randNumCol'] = np.random.randint(1, 6, df1.shape[0])
模型的保存与加载
import pickle
pickle.dump(model, open("model.pickle.dat", "wb"))
model = pickle.load(open(model_path, "rb"))
召回、准确率与阈值探查
from sklearn.metrics import precision_recall_curve,plot_precision_recall_curve
# 生成预测概率
predict_pro = model.predict_proba(test[feature])[:,1:]
# 生成探查 dataframe
pre_recall_curve = pd.DataFrame(precision_recall_curve(test[label],predict_pro)).T
pre_recall_curve.columns = ['precision','recall','thresholds']
pre_recall_curve[pre_recall_curve['precision']>0.95]
# 绘制 precision recall 曲线
plot_precision_recall_curve(model,test[feature],test[label])
模型结果的个性化要求
# 输出概率
# https://stackoverflow.com/questions/903853/how-do-you-extract-a-column-from-a-multi-dimensional-array
# 保留小数点x位
# https://stackoverflow.com/questions/2762058/format-all-elements-of-a-list/2762087
# 分箱 https://www.cnblogs.com/wzdly/p/9853209.html
# 等频
train["age_bin"] = pd.qcut(train["age"],10)
# 等宽
train["age_bin"] = pd.cut(train["age"],10)
模型归因之 shap
# shap 方法
## 解决版本 xgb1.10 兼容问题
## refer: https://github.com/slundberg/shap/issues/1215#issuecomment-641102855
mybooster = model.get_booster()
model_bytearray = mybooster.save_raw()[4:]
def myfun(self=None):
return model_bytearray
mybooster.save_raw = myfun
## shap 分析
import shap
explainer = shap.TreeExplainer(mybooster)
X = data_df[features]
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)
Simhash 相似度匹配
# source: https://leons.im/posts/a-python-implementation-of-simhash-algorithm/
from simhash import Simhash,SimhashIndex
data = {
1: 'How are you? I Am fine. blar blar blar blar blar Thanks.',
2: 'How are you i am fine. blar blar blar blar blar than',
3: 'This is simhash test.',
4: 'How are you i am fine. blar blar blar blar blar thank1',
}
objs = [(str(k), Simhash(v)) for k, v in data.items()]
index = SimhashIndex(objs, k=3)
s1 = Simhash(u'How are you i am fine. blar blar blar blar blar thank')
index.get_near_dups(s1)
cat2vec 特征 embedding
# source: https://github.com/jaume-ferrarons/paratus
# https://github.com/jaume-ferrarons/paratus/blob/master/tests/cat2vec_test.py
# 需注意:进行向量化的特征需 labelencode 过,不能为单独的数字
# 其他参考
# https://towardsdatascience.com/deep-embeddings-for-categorical-variables-cat2vec-b05c8ab63ac0
# https://towardsdatascience.com/categorical-embedding-and-transfer-learning-dd3c4af6345d
正则匹配特定字符后的数字
# refer: https://www.jb51.net/article/166939.htm
import re
logs = 'tensorflow:Final best valid 0 loss=0.20478513836860657 norm_loss=0.767241849151384 roc=0.8262403011322021 pr=0.39401692152023315 calibration=0.9863265752792358 rate=0.0'
# 匹配“calibration=”后面的数字
pattern = re.compile(r'(?<=calibration=)\d+\.?\d*')
result = pattern.findall(logs)
# 结果转 float
list(map(float, result))
模型调优
xgb imbalance data
# https://stackoverflow.com/questions/67868420/xgboost-for-multiclassification-and-imbalanced-data
from sklearn.utils.class_weight import compute_sample_weight
sample_weights = compute_sample_weight(
class_weight='balanced',
y=train_df['class'] #provide your own target name
)
xgb_classifier.fit(X, y, sample_weight=sample_weights)
聚类手肘法
# https://www.cxybb.com/article/Totoro1745/117884008
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
def plot_elbow(X):
"""
手肘法判断聚类个数
:param X: data
:param mode: 1使用KShape, 其他则是kmeans+DTW
:return:
"""
distortions = []
for i in range(2, 30):
ks = KMeans(n_clusters=i,random_state=42).fit(X)
distortions.append(ks.inertia_)
plt.plot(range(2, 30), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion Line')
plt.show()
特征构造
onehot
# refer: https://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
return(res)
gbdt 构造
# https://towardsdatascience.com/feature-generation-with-gradient-boosted-decision-trees-21d4946d6ab5
from sktools import GradientBoostingFeatureGenerator
from sklearn.datasets import load_boston
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
X = load_boston()["data"]
y = load_boston()["target"]
gbf = GradientBoostingFeatureGenerator(regression=True)
lr = LinearRegression()
pipe = Pipeline([("gb_features", gbf), ("logistic", lr)])
pipe.fit(X, y)
mean_absolute_error(pipe.predict(X),y)
特征筛选
1. 共线性剔除
2. 特征重要性排序
3. 去除该特征,判断模型准确率变化情况
4. 将该特征同分布生成,随机配给
面向比赛的特征筛选
1. train-test GAN
- 去掉训练集和测试集原有的标签,按训练与测试赋予新标签
- 训练模型,并观察AUC指标,设定AUC阈值
- Do train
- while auc-online > auc-threshold
- 剔除特征重要性排名第一的特征
- end train
- 获得当前特征集,该集合在训练与测试数据集上接近,不易造成过拟合
模型包
catboost
# https://zhuanlan.zhihu.com/p/37916954
# https://catboost.ai/en/docs/concepts/spark-quickstart-python
# https://stackoverflow.com/questions/64988694/how-can-i-get-the-feature-importance-of-a-catboost-in-a-pandas-dataframe
指标集
混淆矩阵
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
auc
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html
聚类中心距离
# https://stackoverflow.com/questions/40871186/python-dataframe-matrix-of-euclidean-distance
from scipy.spatial.distance import pdist, squareform
dist = pdist(df[['x1', 'x2']], 'euclidean')
df_dist = pd.DataFrame(squareform(dist))
关于作者