Source https://www.kaggle.com/c/recruit-restaurant-visitor-forecasting/discussion

总结一:保证数据同分布

验证集的选取,分布上应尽量靠近测试集。

  • 方式一::对抗验证集的生成。
  • 方式二: 就近选取相同天数。
  • 方式三::类比属性。如本赛题 “golden week” 与 “new year” 类比,选取 “new year” 段作为验证集。

tips: kfold 用在时间序列上不合适,会有数据泄露风险。正确的方法应是滑窗。

总结二:异常值特殊处理

一些特殊的时间节点(或者说是异常值),应该予以特殊考虑。比如本次比赛中的 “golden week”.。需要对其进行变换,而不是直接依靠模型的预测结果。

  • 方式一::等同法

The rules:

Treat holiday as Saturday

If the day before holiday is weekday ,treat the day before holiday as Friday If the day after holiday is weekday ,treat the day after holiday as Monday it work not only golden week but also a lot other holidays.

So the trick is from careful EDA and CV rather than luck

  • 方式二::标记法

days-from-holiday or days-to-holiday or 0/1 flags whether the previous/next day and second-to-last/next-but-one day is a holiday.

#Source https://github.com/MaxHalford/kaggle-recruit-restaurant/blob/master/Solution.ipynb

date_info = pd.read_csv('data/kaggle/date_info.csv')
date_info.rename(columns={'holiday_flg': 'is_holiday', 'calendar_date': 'visit_date'}, inplace=True)
date_info['prev_day_is_holiday'] = date_info['is_holiday'].shift().fillna(0)
date_info['next_day_is_holiday'] = date_info['is_holiday'].shift(-1).fillna(0)

date_info.head()

总结三:特征构造靠倒推

时间序列的特征生成,还是多靠 lagging 往前推的手法。

Source https://www.kaggle.com/c/recruit-restaurant-visitor-forecasting/discussion/49174

lagging visitors features 1-14 (by ‘air_store_id’ and by dayOfWeek), lagging visitors features 1-14 (by ‘air_store_id’ only), the lagging-difference visitors features 1-13, the lagging-difference-delta4 visitors features 1-10, WeightedMovingAverage for the lagging lagging visitors features, mean/median/min/max/(percentile10,30,70,90)/sum/count visitor stats features for past 14 days, 28 days, 60 days, 90 days, 120 days, 180 days, 364 days (by ‘air_store_id’ and by dayOfWeek), mean/median/min/max/(percentile10,30,70,90)/sum/count visitor stats features for past 14 days, 28 days, 60 days, 90 days, 120 days, 180 days, 364 days (by ‘air_store_id’ only). or mean weekly visitors up to lag 20 and for mean monthly visitors up to lag 8 or so…

总结四:独立建模

模型训练。时间序列问题,一般是预测未来一段时间的情况。针对总结三中的特征生成方法,为避免过拟合,宜采用每天分开建模的方法。原因在于:靠后的时间,其特征依赖前面的预测结果来生成。


个人感想

取得好效果,大多数时间靠的不是复杂的模型、不是复杂的CV方式,而是对数据的观察(如: holiday trick 加公开的 kernel, 就可以取得银牌的成绩)以及特征构造方面的实践

Top solutions are rather simple.