标题 创建时间 发布时间 最后一次修改 备注
Model 常见问题速查 2020.08.21 2020.12.21 2021.01.21 /

xgb 特征重要性

sorted_idx = np.argsort(model.feature_importances_)[::-1]

## 按重要性排序
for index in sorted_idx:
    print([train[feature].columns[index], model.feature_importances_[index]]) 
    
## 绘制结果
from xgboost import plot_importance
plot_importance(model, max_num_features = 15)

模型的保存与加载

import pickle

pickle.dump(model, open("model.pickle.dat", "wb"))
model = pickle.load(open(model_path, "rb"))

召回、准确率与阈值探查

from sklearn.metrics import precision_recall_curve,plot_precision_recall_curve

# 生成预测概率
predict_pro = model.predict_proba(test[feature])[:,1:]

# 生成探查 dataframe
pre_recall_curve = pd.DataFrame(precision_recall_curve(test[label],predict_pro)).T
pre_recall_curve.columns = ['precision','recall','thresholds']
pre_recall_curve[pre_recall_curve['precision']>0.95]

# 绘制 precision recall 曲线
plot_precision_recall_curve(model,test[feature],test[label])

模型归因之 shap

# shap 方法
## 解决版本 xgb1.10 兼容问题 
## refer: https://github.com/slundberg/shap/issues/1215#issuecomment-641102855
mybooster = model.get_booster()    
model_bytearray = mybooster.save_raw()[4:]
def myfun(self=None):
    return model_bytearray
mybooster.save_raw = myfun

## shap 分析
import shap

explainer = shap.TreeExplainer(mybooster)
X = data_df[features]
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)

Simhash 相似度匹配

# source: https://leons.im/posts/a-python-implementation-of-simhash-algorithm/

from simhash import Simhash,SimhashIndex

data = {
        1: 'How are you? I Am fine. blar blar blar blar blar Thanks.',
        2: 'How are you i am fine. blar blar blar blar blar than',
        3: 'This is simhash test.',
        4: 'How are you i am fine. blar blar blar blar blar thank1',
    }

objs = [(str(k), Simhash(v)) for k, v in data.items()]
index = SimhashIndex(objs, k=3)

s1 = Simhash(u'How are you i am fine. blar blar blar blar blar thank')
index.get_near_dups(s1)

cat2vec 特征 embedding

# source: https://github.com/jaume-ferrarons/paratus
# https://github.com/jaume-ferrarons/paratus/blob/master/tests/cat2vec_test.py
# 需注意:进行向量化的特征需 labelencode 过,不能为单独的数字

# 其他参考
# https://towardsdatascience.com/deep-embeddings-for-categorical-variables-cat2vec-b05c8ab63ac0
# https://towardsdatascience.com/categorical-embedding-and-transfer-learning-dd3c4af6345d

正则匹配特定字符后的数字

# refer: https://www.jb51.net/article/166939.htm
import re

logs = 'tensorflow:Final best valid   0 loss=0.20478513836860657 norm_loss=0.767241849151384 roc=0.8262403011322021 pr=0.39401692152023315 calibration=0.9863265752792358 rate=0.0'

# 匹配“calibration=”后面的数字
pattern = re.compile(r'(?<=calibration=)\d+\.?\d*')
result = pattern.findall(logs)

# 结果转 float
list(map(float, result))

特征筛选

1. 共线性剔除
2. 特征重要性排序
3. 去除该特征,判断模型准确率变化情况
4. 将该特征同分布生成,随机配给

关于作者