首页  思维导图  详情

腾讯广告算法分析

2017-04-27 21:05:25   81  举报





AI智能生成

腾讯广告算法分析

spark

作者其他创作

大纲/内容

注意事项

值得注意的是，本次竞赛的训练数据提供的截止第31天0点的广告日志，因此，对于最后几天的训练数据，某些label=0并不够准确，可能广告系统会在第31天之后得知label实际上为1。所以，初步将训练的数据后9天的数据从训练数据中去掉，保证训练数据的准确性

算法选择

1 多元逻辑回归

第一次讨论：Spark0. 数据预处理。所有的表融合到一张表。1. 全特征逻辑回归。2. PCA， pearson 3. 特征选择逻辑回归。4. 特征提取。5. 特征构建。6. GBDT 7……8.

确定交叉验证策略——为避免过拟合，确定你在初期阶段已经设置了交叉验证策略。一个很好的CV策略将帮助你在排行榜上获得可靠的得分。

2 神经网络

深度学习资料

https://www.qcloud.com/community/article/20060001483068787

资料链接

用户特征

http://www.chinaz.com/manage/2015/1119/472788.shtml

http://blog.csdn.net/ariessurfer/article/details/40380051

http://blog.csdn.net/lilyth_lilyth/article/details/48032119/

http://www.docin.com/p-1246987389.html

初步操作步骤

1 数据预处理

1 根据userId将多张表聚合成一张表

join详细设计

腾讯算法比赛
1 user_installedapps.csv 和 user_app_action.csv根据userID和appID进行join 得到（userID,appID,installTime的hive表temps1）

hiveContext.sql("select user_app_actions.userID,user_app_actions.installTime,user_app_actions.appID from user_app_actions join user_installedapps where user_app_actions.userID = user_installedapps.userID AND user_app_actions.appID=user_installedapps.appID”)

2 temps1表根据userID和user.csv进行join 得到temps2表
（userID,appID,installTime，age,gender,education,marriageStatus,haveBaby,hometown,residence）

hiveContext.sql("select user.userID,user.age,user.gender,user.education,user.marriageStatus,user.haveBaby,user.hometown,user.residence,temps1.appID,temps1.installTime from user join temps1 where user.userID=temps1.userID”)

3 temps2表根据 appID 和app_categories.csv对应的表进行join操作得到tems3表

hiveContext.sql("select temps2.userID,temps2.age,temps2.gender,temps2.education,temps2.marriageStatus,temps2.haveBaby,temps2.hometown,temps2.residence,temps2.appID,temps2.installTime,app_categories.appCategory from app_categories join temps2 where temps2.appID=app_categories.appID”)

[userID,bigint,null]
[age,int,null]
[gender,int,null]
[education,int,null]
[marriageStatus,int,null]
[haveBaby,int,null]
[hometown,int,null]
[residence,int,null]
[appID,int,null]
[installTime,int,null]
[appCategory,int,null]

4 temps3 表根据userID和train.csv对应的表进行join 得到temps4表

hiveContext.sql("select temps3.userID,temps3.age,temps3.gender,temps3.education,temps3.marriageStatus,temps3.haveBaby,temps3.hometown,temps3.residence,temps3.appID,temps3.installTime,temps3.appCategory,train.label,train.clickTime,train.conversionTime,train.positionID,train.connectionType,train.telecomsOperator from train join temps3 where train.userID=temps3.userID")

5 temps4 表根据 positionID 和position.csv对应的表进行join 得到temps5

hiveContext.sql("select temps4.userID,temps4.age,temps4.gender,temps4.education,temps4.marriageStatus,temps4.haveBaby,temps4.hometown,temps4.residence,temps4.appID,temps4.installTime,temps4.appCategory,temps4.label,temps4.clickTime,temps4.conversionTime,temps4.positionID,temps4.connectionType,temps4.telecomsOperator,position.sitesetID,position.positionType from position join temps4 where temps4.positionID=position.positionID")

6 temps5 表根据 appID和ad.csv对应的表进行join操作

hiveContext.sql("select temps5.userID,temps5.age,temps5.gender,temps5.education,temps5.marriageStatus,temps5.haveBaby,temps5.hometown,temps5.residence,temps5.appID,temps5.installTime,temps5.appCategory,temps5.label,temps5.clickTime,temps5.conversionTime,temps5.positionID,temps5.connectionType,temps5.telecomsOperator,temps5.sitesetID,temps5.positionType,ad.creativeID,ad.adID,ad.camgaignID,ad.advertiserID,ad.appPlatform from ad join temps5 where temps5.appID=ad.appID")

结果表 lasttemps
[userID,bigint,null]
[age,int,null]
[gender,int,null]
[education,int,null]
[marriageStatus,int,null]
[haveBaby,int,null]
[hometown,int,null]
[residence,int,null]
[appID,int,null]
[installTime,int,null]
[appCategory,int,null]
[label,int,null]
[clickTime,int,null]
[conversionTime,int,null]
[positionID,int,null]
[connectionType,int,null]
[telecomsOperator,int,null]
[sitesetID,int,null]
[positionType,int,null]
[creativeID,int,null]
[adID,int,null]
[camgaignID,int,null]
[advertiserID,int,null]
[appPlatform,int,null]

7 预处理

训练集合进行数据过滤，降噪，降维，特征选取

8 模型训练

9 模型预测

10 模型评估

11 模型优化

12 提交结果

建总表

hiveContext.sql("create table newalltables(userID int,age int,gender int,education int,marriageStatus int,haveBaby int,hometown int,residence int,appID int,installTime int,appCategory int,label int,clickTime int,conversionTime int,positionID int,connectionType int,telecomsOperator int,sitesetID int,positionType int,creativeID int,adID int,camgaignID int,advertiserID int,appPlatform int) location '/usr/local/bigdata/tencentData/pre/newalltables'")

插入数据

scala> hiveContext.sql("insert into table newalltables select userID,age,gender,education,marriageStatus,haveBaby,hometown,residence,appID,installTime,appCategory,label,clickTime,conversionTime,positionID,connectionType,telecomsOperator,sitesetID,positionType,creativeID,adID,camgaignID,advertiserID,appPlatform from lasttemps")

test 表聚合

1 hiveContext.sql("select test.instanceID,test.userID,test.positionID,test.connectionType,test.clickTime,test.creativeID,test.telecomsOperator,user.age,user.gender,user.education,user.marriageStatus,user.haveBaby,user.hometown,user.residence from test join user where user.userID=test.userID") res12.createOrReplaceTempView("testtemps1")
得到临时表 testtemps1

2 通过 testtemps1 临时表join ad表：过程如下所示（根据creativeID来join）得到 testtemps3

hiveContext.sql("select testtemps1.instanceID,testtemps1.userID,testtemps1.positionID,testtemps1.connectionType,testtemps1.clickTime,testtemps1.creativeID,testtemps1.telecomsOperator,testtemps1.age,testtemps1.gender,testtemps1.education,testtemps1.marriageStatus,testtemps1.haveBaby,testtemps1.hometown,testtemps1.residence,ad.adID,ad.camgaignID,ad.advertiserID,ad.appID,ad.appPlatform from testtemps1 join ad where ad.creativeID=testtemps1.creativeID")

join (根据positionID) join position表得到 testtemp4

hiveContext.sql("select testtemps3.instanceID,testtemps3.userID,testtemps3.positionID,testtemps3.connectionType,testtemps3.clickTime,testtemps3.creativeID,testtemps3.telecomsOperator,testtemps3.age,testtemps3.gender,testtemps3.education,testtemps3.marriageStatus,testtemps3.haveBaby,testtemps3.hometown,testtemps3.residence,testtemps3.adID,testtemps3.camgaignID,testtemps3.advertiserID,testtemps3.appID,testtemps3.appPlatform,position.sitesetID,position.positionType from testtemps3 join position where position.positionID=testtemps3.positionID")

testtemp4 join app_categories(根据appID join)

hiveContext.sql("select testtemps4.instanceID,testtemps4.userID,testtemps4.positionID,testtemps4.connectionType,testtemps4.clickTime,testtemps4.creativeID,testtemps4.telecomsOperator,testtemps4.age,testtemps4.gender,testtemps4.education,testtemps4.marriageStatus,testtemps4.haveBaby,testtemps4.hometown,testtemps4.residence,testtemps4.adID,testtemps4.camgaignID,testtemps4.advertiserID,testtemps4.appID,testtemps4.appPlatform,testtemps4.sitesetID,testtemps4.positionType,app_categories.appCategory from testtemps4 join app_categories where app_categories.appID=testtemps4.appID")

hiveContext.sql("create table lasttest2(instanceID int,userID int,positionID int,connectionType int,clickTime int,creativeID int,telecomsOperator int,age int,gender int,education int,marriageStatus int,haveBaby int,hometown int,residence int,adID int,camgaignID int,advertiserID int,appID int,appPlatform int,sitesetID int,positionType int,appCategory int) location '/usr/local'")

2 去掉训练数据后9天

3 去掉userId等所有id

预测方式

测试数据输入

instanceID

唯一标识一个样本

-1

clickTime

creativeID

userID

positionID

connectionType

telecomsOperator

测试数据输出

instanceID唯一标识一个测试样本，必须升序排列

prob为模型预估的广告转化概率

1. 预测点击率 2. 预测激活概率

转化率公式

类似于点击淘宝某个广告位总数分之购买总数

损失函数评估方式

子主题

训练数据广告特征

账户ID(advertiserID)

腾讯社交广告的账户结构分为四级：账户——推广计划——广告——素材，账户对应一家特定的广告主。

推广计划ID(campaignID)

推广计划是广告的集合，类似电脑文件夹功能。广告主可以将推广平台、预算限额、是否匀速投放等条件相同的广告放在同一个推广计划中，方便管理。

广告ID(adID)

腾讯社交广告管理平台中的广告是指广告主创建的广告创意(或称广告素材)及广告展示相关设置，包含广告的基本信息(广告名称，投放时间等)，广告的推广目标，投放平台，投放的广告规格，所投放的广告创意，广告的受众(即广告的定向设置)，广告出价等信息。单个推广计划下的广告数不设上限。

素材ID(creativeID)

展示给用户直接看到的广告内容，一条广告下可以有多组素材。

AppID(appID)

广告推广的目标页面链接地址，即点击后想要展示给用户的页面，此处页面特指具体的App。多个推广计划或广告可以同时推广同一个App。

App分类(appCategory)

App开发者设定的App类目标签，类目标签有两层，使用2位数字编码，如“32”表示一级类目ID为3，二级类目ID为2，类目未知或者无法获取时，标记为0。

App平台(appPlatform)

App所属操作系统平台，取值为Android，iOS，未知。同一个appID只会属于一个平台。

训练数据上下文广告特征

广告位ID(positionID)

广告曝光的具体位置，如QQ空间Feeds广告位。

站点集合ID(sitesetID)

多个广告位的聚合，如QQ空间

广告位类型(positionType)

对于某些站点，人工定义的一套广告位规格分类，如Banner广告位。

联网方式(connectionType)

移动设备当前使用的联网方式，取值包括2G，3G，4G，WIFI，未知

运营商(telecomsOperator)

移动设备当前使用的运营商，取值包括中国移动，中国联通，中国电信，未知

训练数据用户特征

用户ID(userID)

唯一标识一个用户

年龄(age)

取值范围[0, 80]，其中0表示未知。

性别(gender

取值包括男，女，未知。

学历(education)

用户当前最高学历，不区分在读生和毕业生，取值包括小学，初中，高中，专科，本科，硕士，博士，未知

婚恋状态(marriageStatus)

用户当前感情状况，取值包括单身，新婚，已婚，未知。

育儿状态(haveBaby)

用户当前孕育宝宝状态，取值包括孕育中，宝宝0~6个月，宝宝6~12个月，宝宝1~2岁，宝宝2~3岁，育儿但宝宝年龄未知，未知。

家乡/籍贯(hometown)

用户出生地，取值具体到市级城市，使用二级编码，千位百位数表示省份，十位个位数表示省内城市，如1806表示省份编号为18，城市编号是省内的6号，编号0表示未知。

常住地(residence)

最近一段时间用户长期居住的地方，取值具体到市级城市，编码方式与家乡相同。

App安装列表(appInstallList)

截止到某一时间点用户全部的App安装列表(appID)，已过滤高频和低频App。

App安装流水

最近一段时间内用户安装App行为流水，包括appID，行为发生时间(installTime)和app类别(appCategory)，已过滤高频和低频App。

-1代表label占位使用，表示待预测。

数据整合操作

创建hive表

scala> hiveContext.sql("create table ad(creativeID int,adID int,camgaignID int,advertiserID int,appID int,appPlatform int) row format DELIMITED FIELDS TERMINATED BY ',' location '/usr/local/bigdata/pre'").collect.foreach(println)

scala> hiveContext.sql("LOAD DATA LOCAL INPATH '/usr/local/bigdata/pre/ad.csv'into table ad")

自由主题