KDD CUP 2009

KDD Cup 2009: Overview

This Year's Challenge

Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

The most practical way, in a CRM system, to build knowledge on customer is to produce scores. A score (the output of a model) is an evaluation for all instances of a target variable to explain (i.e. churn, appetency or up-selling). Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system (IS), for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application.

The challenge is to beat the in-house system developed by Orange Labs. It is an opportunity to prove that you can deal with a very large database, including heterogeneous noisy data (numerical and categorical variables), and unbalanced class distributions. Time efficiency is often a crucial point. Therefore part of the competition will be time-constrained to test the ability of the participants to deliver solutions quickly.

Outline of the sections below :

Tasks
Rules
Data
Results
Faqs
Organisation

You may also find :

an analysis of the results here : the paper, the slides
a presentation of the challenge : the slides

Tasks

KDD Cup 2009: Tasks

Task Description

The task is to estimate the churn, appetency and up-selling probability of customers, hence there are three target values to be predicted. The challenge is staged in phases to test the rapidity with which each team is able to produce results. A large number of variables (15,000) is made available for prediction. However, to engage participants having access to less computing power, a smaller version of the dataset with only 230 variables will be made available in the second part of the challenge.

Churn (wikipedia definition): Churn rate is also sometimes called attrition rate. It is one of two primary factors that determine the steady-state level of customers a business will support. In its broadest sense, churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time. The term is used in many contexts, but is most widely applied in business with respect to a contractual customer base. For instance, it is an important factor for any business with a subscriber-based service model, including mobile telephone networks and pay TV operators. The term is also used to refer to participant turnover in peer-to-peer networks.
Appetency: In our context, the appetency is the propensity to buy a service or a product.
Up-selling (wikipedia definition): Up-selling is a sales technique whereby a salesman attempts to have the customer purchase more expensive items, upgrades, or other add-ons in an attempt to make a more profitable sale. Up-selling usually involves marketing more profitable services or products, but up-selling can also be simply exposing the customer to other options he or she may not have considered previously. Up-selling can imply selling something additional, or selling something that is more profitable or otherwise preferable for the seller instead of the original sale.

Evaluation

The performances are evaluated according to the arithmetic mean of the AUC for the three tasks (churn, appetency. and up-selling). This is what we call "Score" in the Result section of this page.

Sensitivity and specificity

The main objective of the challenge is to make good predictions of the target variables. The prediction of each target variable is thought of as a separate classification problem. The results of classification, obtained by thresholding the prediction score, may be represented in a confusion matrix, where tp (true positive), fn (false negative), tn (true negative) and fp (false positive) represent the number of examples falling into each possible outcome:

		Prediction
		Class +1	Class -1
Truth	Class +1	tp	fn
Truth	Class -1	fp	tn

Any sort of numeric prediction score is allowed, larger numerical values indicating higher confidence in positive class membership.

We define the sensitivity (also called true positive rate or hit rate) and the specificity (true negative rate) as:

Sensitivity = tp/pos
Specificity = tn/neg

where pos = tp+fn is the total number of positive examples and neg=tn+fp the total number of negative examples.

AUC

The results will be evaluated with the so-called Area Under Curve (AUC). It corresponds to the area under the curve obtained by plotting sensitivity against specificity by varying a threshold on the prediction values to determine the classification result. The AUC is related to the area under the lift curve and the Gini index used in marketing (Gini = 2 AUC -1). The AUC is calculated using the trapezoid method. In the case when binary scores are supplied for the classification instead of discriminant values, the curve is given by {(0,1), (tn/(tn+fp), tp/(tp+fn)), (1,0)} and the AUC is just the Balanced ACcuracy BAC.

[Go Top]

Rules

Rules:

KDD Cup 2009: Rules

Competition Rules

Conditions of participation: Anybody who complies with the rules of the challenge (KDDcup 2009) is welcome to participate. Only the organizers are excluded from participating. The KDDcup 2009 is part of the competition program of the Knowledge Discovery in Databases conference (KDD 2009), Paris June 28-July 1st, 2009. Participants are not required to attend the KDDcup 2009 workshop, which will be held at the conference, and the workshop is open to anyone who registers. The proceedings of the competition will be published by the Journal of Machine Learning Research Workshop and Conference Proceedings (JMLR WC&P).

Anonymity: All entrants must identify themselves by registering on the KDDcup 2009 website. However, they may elect to remain anonymous by choosing a nickname and checking the box "Make my profile anonymous". If this box is checked, only the nickname will appear in the result tables instead of the real name. Participant emails will not appear anywhere on the website and will be used only by the organizers to communicate with the participants. To be eligible for prizes the participants will have to publicly reveal their identity and uncheck the box "Make my profile anonymous".

Data: The dataset is available for download from the Data page to registered participants. The data are available in several archives to facilitate downloading and two versions are made available ("small" with 230 variables, and "large" with 15,000 variables). The participants may enter results on either or both versions, which correspond to the same data entries, the 230 variables of the small version being just a subset of the 15,000 variables of the large version. Both training and test data are available without the true target labels. For practice purpose, "toy" training labels are available together with the training data from the onset of the challenge in the fast track. The results on toy targets (T) will not count for the final evaluation. The real training labels of the tasks "churn" (C), "appetency" (A), and "up-selling" (U), will be made available for download separately half-way through the challenge.

Challenge duration and tracks: The challenge starts March 10, 2009 and ends May 11, 2009. There are two challenge tracks:

FAST (large) challenge: Results submitted on the LARGE dataset within five days of the release of the real training labels will count towards the fast challenge.
SLOW challenge: Results on the small dataset and results on the large dataset not qualifying for the fast challenge, submitted before the KDDcup 2009 deadline May 11, 2009, will count toward the SLOW challenge.

If more than one submission is made in either track and with either dataset, the last submission before the track deadline will be taken into account to determine the ranking of participants and attribute the prizes. You may compete in both tracks. There are prizes in both tracks.

On-line feed-back: During the challenge, the training set performances as available on the Results page as well as partial information on test set performances: The test set performances on the toy task (T) and performances on a fixed 10% subset of the test examples for the real tasks (C, A, U). After the challenge is over, the performances on the whole test set will be calculated and substituted in the result tables.

Submission method: The method of submission was via the form on the Submission page. To be ranked, submissions must comply with the Instructions. A submission should include results on both training and test set on at least one of the tasks (T, C, A, U), but it may include results on several tasks. A submission will be considered "complete" and eligible for prizes if it contains 6 files corresponding to training and test data predictions for the tasks C, A, and U, either for the small or for the large dataset (or for both). Results on the practice task T will not count as part of the competition. If you encounter problems with the submission process, please contact the Challenge Webmaster. Multiple submissions are allowed, but please limit yourself to 5 submissions per day maximum. For your final entry in the slow track, you may submit results on either or both small and large datasets in the same archive (hence you get 2 chances of winning).

Evaluation and ranking: For each entrant, only the last valid entry will count towards determining the winner in each track (fast and slow). We limit each participating person to a single final entry in each track (see the FAQs page for the conditions under which you can work in teams). Valid entries must include results on all three real tasks. The method of scoring is posted on the Tasks page. Prizes will be attributed only to entries performing better than the baseline method (Naive Bayes). The results of the baseline method are provided in the Resul page. These are not the best results obtained by the organization team at Orange, they are easy to outperform, but difficult to attain by chance.

Reproducibility: Participation is not conditioned on delivering code nor publishing methods. However, we will ask the top ranking participants to voluntarily fill out a fact sheet about their methods, contribute papers to the proceedings, and help reproducing their results.

[Go Top]

Data

Data:

KDD Cup 2009: Data

Data Download

Training and test data matrices and practice target values

The large dataset archives are available since the onset of the challenge. The small dataset will be made available at the end of the fast challenge. Both training and test sets contain 50,000 examples. The data are split similarly for the small and large versions, but the samples are ordered differently within the training and within the test sets. Both small and large datasets have numerical and categorical variables. For the large dataset, the first 14,740 variables are numerical and the last 260 are categorical. For the small dataset, the first 190 variables are numerical and the last 40 are categorical. Toy target values are available only for practice purpose. The prediction of the toy target values will not be part of the final evaluation.

Small version (230 var.):

orange_small_train.data.zip (8.2 Mbytes)
orange_small_test.data.zip (8.2 Mbytes)

Large version (15,000 var.):

orange_large_train.data.chunk1.zip (52.7 Mbytes)
orange_large_train.data.chunk2.zip (52.7 Mbytes)
orange_large_train.data.chunk3.zip (52.6 Mbytes)
orange_large_train.data.chunk4.zip (52.5 Mbytes)
orange_large_train.data.chunk5.zip (52.6 Mbytes)

orange_large_test.data.chunk1.zip (52.8 Mbytes)
orange_large_test.data.chunk2.zip (52.5 Mbytes)
orange_large_test.data.chunk3.zip (52.6 Mbytes)
orange_large_test.data.chunk4.zip (52.6 Mbytes)
orange_large_test.data.chunk5.zip (52.6 Mbytes)

Toy targets (large):

orange_large_train_toy.labels

True task labels

Real binary targets (small):

Real binary targets (large):

Data Format

The datasets use a format similar as that of the text export format from relational databases:

One header lines with the variables names
One line per instance
Separator tabulation between the values
There are missing values (consecutive tabulations)

The large matrix results from appending the various chunks downloaded in their order number. The header line is present only in the first chunk.

The target values (.labels files) have one example per line in the same order as the corresponding data files. Note that churn, appetency, and up-selling are three separate binary classification problems. The target values are +1 or -1. We refer to examples having +1 (resp. -1) target values as positive (resp. negative) examples.

The Matlab matrices are numeric. When loaded, the data matrix is called X. The categorical variables are mapped to integers. Missing values are replaced by NaN for the original numeric variables while they are mapped to 0 for categorical variables.

[Go Top]

Results

Results:

KDD Cup 2009: Results

Winners of KDD Cup 2009: Fast Track

First Place: IBM Research
Ensemble Selection for the KDD Cup Orange Challenge

First Runner Up: ID Analytics, Inc
KDD Cup Fast Scoring on a Large Database

Second Runner Up: Old dogs with new tricks (David Slate, Peter W. Frey)

Winners of KDD Cup 2009: Slow Track

First Place: University of Melbourne
University of Melbourne entry

First Runner Up: Financial Engineering Group, Inc. Japan
Stochastic Gradient Boosting

Second Runner Up: National Taiwan University, Computer Science and Information Engineering
Fast Scoring on a Large Database using regularized maximum entropy model,
categorical/numerical balanced AdaBoost and selective Naive Bayes

Full Results: Fast Track

Rank	Team Name	Method	AUC
Rank	Team Name	Method	Churn	Appetency	Upselling	Score
1	IBM Research	Final Submission	0.7611	0.8830	0.9038	0.8493
2	ID Analytics, Inc	DT	0.7565	0.8724	0.9056	0.8448
3	Old dogs with new tricks	Our own method	0.7541	0.8740	0.9050	0.8443
4	Crusaders	Joint Score Technique	0.7569	0.8688	0.9034	0.8430
5	Financial Engineering Group, Inc. Japan	boosting	0.7498	0.8732	0.9057	0.8429
6	LatentView Analytics	Boosting	0.7579	0.8670	0.9034	0.8428
7	Data Mining	Logistic	0.7580	0.8659	0.9034	0.8424
8	StatConsulting (K.Ciesielski, M.Sapinski, M.Tafil)	AdvancedMiner	0.7544	0.8723	0.8997	0.8421
9	Sigma	Decision Tree Algo	0.7568	0.8644	0.9034	0.8415
10	Analytics	CART	0.7564	0.8644	0.9034	0.8414
11	Ming Li & Yuwei Zhang	me	0.7507	0.8683	0.9050	0.8413
12	Hungarian Academy of Sciences	fri4	0.7496	0.8683	0.9042	0.8407
13	Oldham Athletic Reserves	tiberius10	0.7492	0.8699	0.9026	0.8406
14	Swetha	Logistic	0.7550	0.8659	0.8996	0.8401
15	VladN	vnf8c	0.7415	0.8692	0.9012	0.8373
16	VADIS	Bagging	0.7474	0.8631	0.8994	0.8366
17	brendano	random forests (res11)	0.7468	0.8627	0.9003	0.8366
18	commendo	1 before noon	0.7381	0.8693	0.8988	0.8354
19	FEG CTeam	Boosting	0.7389	0.8616	0.9011	0.8338
20	Vadis Team 2	Best final	0.7442	0.8568	0.8996	0.8335
21	National Taiwan University, Computer Science and Information Engineering	all2	0.7428	0.8679	0.8890	0.8332
22	Kranf	TIM	0.7463	0.8478	0.8980	0.8307
23	Neo Metrics	final2	0.7454	0.8449	0.8994	0.8299
24	ooo	10-3	0.7427	0.8520	0.8920	0.8289
25	TonyM	mymethod5	0.7397	0.8481	0.8988	0.8289
26	AIIALAB	ensemble	0.7413	0.8458	0.8969	0.8280
27	Uni Melb	hfinal	0.7087	0.8669	0.8996	0.8251
28	Christian Colot	My GoldMiner	0.7183	0.8577	0.8958	0.8240
29	Céline Theeuws	final	0.7346	0.8476	0.8835	0.8219
30	m&m	final test	0.7218	0.8423	0.8924	0.8189
31	Predictive Analytics	Logistic	0.7131	0.8336	0.8917	0.8128
32	DKW	NN / Logistic Regression on Laptop	0.6980	0.8449	0.8928	0.8119
33	NICAL	Dys	0.7108	0.8461	0.8707	0.8092
34	UW	eq+uneq	0.6804	0.8531	0.8815	0.8050
35	Prem Swaroop	thmdkd4	0.6972	0.8384	0.8794	0.8050
36	Dr. Bunsen Honeydew	submission #004	0.7048	0.8235	0.8760	0.8015
37	dodio	L2	0.7179	0.8474	0.8356	0.8003
38	FEG D TEAM	mix2	0.6997	0.8139	0.8824	0.7987
39	minos	rdf	0.6828	0.8233	0.8698	0.7920
40	M	Release1	0.7289	0.8341	0.8053	0.7894
41	dataminers	Ensemble Model 2	0.6850	0.8288	0.8205	0.7781
42	Weka1	final	0.6795	0.7727	0.8764	0.7762
43	idg	b_1	0.6851	0.7931	0.8458	0.7747
44	HP Labs - Analytics Research	lrs	0.6414	0.8042	0.8607	0.7687
45	hyperthinker	L1-regularization	0.6770	0.7822	0.8386	0.7659
46	vodafone	b	0.6819	0.7216	0.8917	0.7651
47	paberlo	Method10	0.6717	0.7544	0.8451	0.7571
48	Lenca	test	0.6713	0.7493	0.8456	0.7554
49	C.A.Wang	Bagging	0.5956	0.8300	0.8369	0.7541
50	FEG B	Naive Bayes & Logit	0.6499	0.8317	0.7777	0.7531
51	rw	lst	0.6368	0.7045	0.8070	0.7161
52	Tree Builders	VA - NN	0.6358	0.6583	0.7918	0.6953
53	Leo	Naive Coding	0.5928	0.5544	0.7314	0.6262
54	ZhiGao	Zhigao5	0.5425	0.5431	0.5774	0.5543
55	homehome	etc	0.5835	0.3876	0.6290	0.5334
56	decaff	zzz	0.5288	0.5009	0.5608	0.5302
57	Claminer	only churn	0.5731	0.5095	0.5055	0.5294
58	Klimma	simple	0.5034	0.5025	0.4965	0.5008
59	Reference	Random predictions	0.5030	0.4889	0.5069	0.4996

Full Results: Slow Track

Rank	Team Name	Method	AUC
Rank	Team Name	Method	Churn	Appetency	Upselling	Score
1	IBM Research	Submission	0.7651	0.8819	0.9092	0.8521
2	Uni Melb	The generally satisfactory model	0.7570	0.8836	0.9048	0.8484
3	ID Analytics, Inc	DT with bagging (large + small)	0.7614	0.8761	0.9061	0.8479
4	Financial Engineering Group, Inc. Japan	boosting	0.7589	0.8768	0.9074	0.8477
5	National Taiwan University, Computer Science and Information Engineering	all_final	0.7558	0.8789	0.9036	0.8461
6	Hungarian Academy of Sciences	last_after_last	0.7567	0.8736	0.9065	0.8456
7	Neo Metrics	FINAL	0.7521	0.8756	0.9059	0.8445
8	Ming Li & Yuwei Zhang	me	0.7512	0.8744	0.9059	0.8439
9	Data Mining	Multiple Techniques	0.7574	0.8700	0.9036	0.8437
10	Tree Builders	Sub2 Better Variables	0.7552	0.8736	0.9021	0.8436
11	dataminers	Combined Model Submission 1	0.7553	0.8736	0.9016	0.8435
12	Oldham Athletic Reserves	Tiberius Data Mining Algorithms	0.7525	0.8720	0.9028	0.8424
13	Swetha	Logistic	0.7580	0.8652	0.9038	0.8423
14	Analytics	Decision Tree Algo	0.7559	0.8691	0.9014	0.8421
15	Old dogs with new tricks	Our own method	0.7488	0.8730	0.9040	0.8419
16	Sigma	Enemble classifier	0.7580	0.8636	0.9041	0.8419
17	Weka1	finalRules	0.7477	0.8727	0.9049	0.8418
18	Predictive Analytics	Adaptive Boosting Algorithm	0.7579	0.8676	0.8995	0.8416
19	LatentView Analytics	Segmented Joint Score	0.7578	0.8675	0.8995	0.8416
20	Crusaders	CART + Combination Logic	0.7579	0.8660	0.8995	0.8411
21	HP Labs - Analytics Research	adds	0.7500	0.8653	0.9049	0.8401
22	VladN	vn_large	0.7484	0.8597	0.9040	0.8373
23	brendano	random forests (res11)	0.7468	0.8627	0.9003	0.8366
24	VADIS	Final Slow	0.7442	0.8631	0.9013	0.8362
25	AIIALAB	merge	0.7467	0.8551	0.8963	0.8327
26	ooo	3	0.7434	0.8583	0.8936	0.8318
27	m&m	final test	0.7434	0.8486	0.8984	0.8301
28	Vadis Team 2	test1	0.7381	0.8493	0.9013	0.8296
29	commendo	a2 bag 10	0.7321	0.8637	0.8920	0.8293
30	Kranf	The Intelligent Mining Machine(TIM)	0.7369	0.8434	0.8963	0.8255
31	Christian Colot	My GoldMiner	0.7183	0.8577	0.8958	0.8240
32	UW	Final-2	0.7171	0.8455	0.8927	0.8184
33	AI	Kahu	0.7255	0.8408	0.8872	0.8179
34	NICAL	Dys	0.7108	0.8461	0.8707	0.8092
35	creon	boosting combinations	0.7359	0.8268	0.8615	0.8081
36	LosKallos	bit regurb	0.7398	0.8204	0.8621	0.8074
37	FEG_BOSS	Boosting	0.7406	0.8149	0.8621	0.8058
38	Lenca	RAE	0.7348	0.8175	0.8629	0.8051
39	Prem Swaroop	thmdkd4	0.6972	0.8384	0.8794	0.8050
40	M	final	0.7319	0.8153	0.8644	0.8038
41	FEG ATeam	logit	0.7325	0.8160	0.8610	0.8031
42	pavel	combfinal	0.7358	0.8130	0.8591	0.8027
43	Additive Groves	Additive Groves	0.7135	0.8311	0.8605	0.8017
44	nikhop	bteqwcomb	0.7359	0.8098	0.8589	0.8015
45	mi	a	0.7365	0.8090	0.8569	0.8008
46	java. lang. OutOfMemory Error	weka2	0.7360	0.8090	0.8572	0.8007
47	dodio	L2	0.7179	0.8474	0.8356	0.8003
48	Lajkonik	final submission	0.7323	0.8073	0.8600	0.7999
49	FEG CTeam	test2	0.7321	0.8062	0.8596	0.7993
50	Céline Theeuws	test 7	0.7230	0.8147	0.8584	0.7987
51	zlm	dt	0.7232	0.8175	0.8544	0.7983
52	FEG B	submit8	0.7354	0.8031	0.8544	0.7976
53	CSN	cs_nott4	0.7282	0.8051	0.8594	0.7976
54	TonyM	final	0.7249	0.7996	0.8596	0.7947
55	Sundance	BT	0.7244	0.8172	0.8400	0.7939
56	minos	rdf	0.6828	0.8233	0.8698	0.7920
57	Miner12	Model56	0.7264	0.7973	0.8484	0.7907
58	FEG D TEAM	logit+tree	0.7219	0.8077	0.8422	0.7906
59	DKW	Logistic Regression with interactions	0.7153	0.8015	0.8547	0.7905
60	idg	disc3	0.7146	0.8041	0.8507	0.7898
61	homehome	test572	0.7176	0.8062	0.8416	0.7885
62	Mai Dang	Boosting GS10+NN+KR	0.7167	0.8099	0.8372	0.7880
63	parramining. blogspot. com	Basic play	0.7134	0.8056	0.8438	0.7876
64	bob	thirdM	0.7053	0.8052	0.8520	0.7875
65	muckl	final2	0.7239	0.8195	0.8180	0.7871
66	C.A.Wang	Bagging	0.7067	0.8043	0.8502	0.7871
67	KDD@PT	c_a_u_	0.7081	0.7989	0.8528	0.7866
68	decaff	zzz	0.7120	0.7916	0.8498	0.7845
69	StatConsulting (K.Ciesielski, M.Sapinski, M.Tafil)	Original232Vars	0.7137	0.7605	0.8501	0.7748
70	Dr. Bunsen Honeydew	final submission	0.7170	0.8052	0.7954	0.7725
71	Raymond Falk	Orange_ Small_ Results_ KDD2009_ OptInference	0.6905	0.7744	0.8465	0.7705
72	vodafone	smallstm3.3	0.7258	0.6915	0.8582	0.7585
73	paberlo	Method10	0.6717	0.7544	0.8451	0.7571
74	K2	Btest	0.7078	0.7670	0.7931	0.7560
75	Claminer	class1	0.6665	0.7785	0.8199	0.7550
76	rw	t	0.7257	0.6928	0.8369	0.7518
77	Leo	Naive Coding	0.6528	0.7504	0.7693	0.7242
78	Persistent	Hello_Theta	0.6416	0.7167	0.7370	0.6984
79	Louis Duclos-Gosselin	Personal Algorithm	0.6168	0.7571	0.6792	0.6843
80	Chy	LDA	0.6027	0.7201	0.6936	0.6721
81	Abo-Ali	Weka	0.6249	0.6425	0.7218	0.6630
82	sduzx	Straft	0.6057	0.6465	0.6167	0.6230
83	MT	finnaly	0.5494	0.5378	0.6873	0.5915
84	Shiraz University - Undergradute Team	lazy	0.5077	0.5047	0.7000	0.5708
85	ZhiGao	Zhigao5	0.5425	0.5431	0.5774	0.5543
86	Klimma	tree	0.5283	0.5231	0.5909	0.5474
87	hyperthinker	knn	0.5000	0.5090	0.5000	0.5030
88	Reference	Random predictions	0.4997	0.5057	0.5025	0.5026
89	thes	bz	0.5016	0.4993	0.4982	0.4997

[Go Top]

FAQS

FAQS:

KDD Cup 2009: FAQs

Participation and Registration

What is the goal of the challenge?

The challenge consists of several classification problems. The goal is to make the best possible predictions of a binary target variable from a number of predictive variables.

Can I enter under multiple names?

No, we limit each participant to one final entry, which may contain results on the large dataset only in the fast track and on either or both the small and the large dataset in the slow track. Registering under multiple names would be considered cheating and disqualify you. Your real identity must be known to the organizers. You may hide your identity only to the outside by checking the "Make my profile anonymous" in the registration form.

Can I participate to multiple teams?

No. Each individual is allowed to make only a single final entry into the challenge to compete towards the prizes. During the development period, each team must have a different registered team leader. To be ranked in the challenge and qualify for prizes, each registered participant (individual or team leader) will have to disclose the names of eventual team members, before the final results of the challenge get released. Hence, at the end of the challenge, you will have to choose to which team you want to belong (only one!), before the results are publicly released. After the results are released, no change in team composition will be allowed.

I understand that one person can join only one team, however, is it ok to have many teams in the same organization?

Yes it is OK. Each team leader must be a different person and must register and the teams cannot intersect. Before the end of the challenge the team leaders will have to declare the composition of their team. This will have to correspond to the list of co-authors in the proceedings, if they decide to publish their results. Hence a professor cannot have his/her name on all his/her students papers (but can be thanked in acknowledgements).

How do I register a team?

Only register the team leader and choose a nickname for your team. We'll let you know later how to disclose the members of your team.

Can the organizers enter the challenge?

No. The organizers may make entries under the common name "Reference" to stimulate the competition, but they do not compete towards the prizes.

Data: Download, format, etc.

I have problems with the ZIP files which appear to be corrupted. Can I get a DVD?

Try do download one archive at a time. If the problem persists, contact the organizers so they can send you a DVD.

Are the data available in other formats: matlab, SAS, etc.?

There are several Matlab versions posted on the Forum. There is also a numerical version of the categorical variables in text format for the large dataset. Please post your own version of the data to share it with others.

Is there sample code available?

Yes. We made available sample Matlab codeto help you format your results. There are also examples to call CLOP models from that code. AT THIS STAGE THERE IS NOT YET MATLAB SUPPORT FOR HANDLING THE LARGE DATASET.

Are the true targets distributed similarly as the toy target?

No. The toy target is generated by an artificial stochastic process. The proportion of examples in either class is different in the real targets. The real targets have less than 10% of examples in the positive class.

I have observed that the last columns (after variable 14740) are not numerical, are the data corrupted?

The last variables are categorical variables. The strings correspond to category codes. This could be for instance a city name. But for reasons of privacy, the real names were replaced by strings that are meaningless.

I have observed that some columns are empty or constant, are the data corrupted?

No. This is correct, and part of the challenge, that deals with automatic data preparation and modeling in the context of industrial real data. Filtering constant data is the easy part of the challenge.

I have observed that the first chunk of the large dataset contains only 9999 lines, is this correct?

Yes. Chunk 1 contains 9999 data lines plus the header. All other chunks have no header. The last chunk has 10001 lines. So the total is 50000 lines of data.

In the categorical variables, do the value need to be handled as meaningful sequences or are they just codes?

The original categorical values where symbols, not indicating any category ordering. The category symbols have been replaced by random anonymized values (strings) with no semantic, in 1 to 1 bijection with the original values so as to keep the structure of the data.

Do the targets correspond to single or multiple products?

The targets correspond to single products (but not necessarily the same one). For instance, churn concerns mobile phone customers switching providers and up-selling the plan upgrade to include television.

Is there a meaning in the variable ordering?

No. The variables are randomly ordered.

Are the variables in the small dataset a subset of those in the large dataset?

Yes. However, they are disguised to make it non-trivial to identify and discourage people to do so. The examples are also ordered differently to render such mapping even harder. We wish that participants work on each dataset separately, although they may work on both.

Are the training and test data drawn from the same distribution?

Yes.

Are the set of categorical variable values the same in the training and test data?

Not necessarily. Some values might show up only in training data or only in test data.

Are there the same number of values in each line?

There can be missing values. The values are separated by tabulations. Two consecutive tabs indicate a missing value.

Is it allowed to unscramble the small dataset?

Scrambling was done to encourage the participants to work separately on the small dataset and the big dataset. If we wanted the participants to be able to use the features of the small dataset in addition to those they might select from the big one, we would not have scrambled the data. We realize however that, if we forbid the participants from unscrambling and consider it cheating, we would have difficulties enforcing that rule. Hence, participants who unscramble the small dataset will not be disqualified from the competition. All participants will be requested to report at the end of the challenge whether they made use of unscrambling and whether they derived some advantage from it.

Evaluation: Tracks, submission format, etc.

Why do we need to submit results on training data?

In this way we can assess the robustness of the models. If you make great predictions on training data and perform poorly on test data, your method likely is overfitting.

What is the purpose of giving performances on 10% or the test data?

We want to give feed-back to the participants to motivate them. In this way, they can see how roughly their performance compares to others. But, by giving feed-back on only 10% of the data, we avoid that they fine tune their system using the test data (i.e. de facto "learn" from the test data). There will be a slight bias in performance because of the 10% on which feed-back is provided, but it is the same bias for all contestants.

Is it correct that even if I submit the result on the large dataset in the fast track, I can submit the result on the large dataset in the slow track together with that on the small dataset?

Yes. In fact, you may submit as many times as you want. But, only the last complete entry (with churn appetency and upselling results both on training and test data = 6 files) will count in each track, depending on the submission date. In the fast track, you may enter only large dataset results, so you get 1 chance. In the slow track you may enter on both small and large datasets so you get 2 chances (the best of your 2 results will be taken into account). In total, you get 3 chances of winning.

If I submit results on both the small and large datasets in the slow track, how will results be evaluated?

The best of your 2 results will be taken into account.

Both small and large entries compete for the slow prize, but they seem to correspond to two distinct problems? Shouldn't there be two slow track prizes?

The small dataset is a downsized version of the large one: same examples, a subset of the features. To distinguish the two, the examples were ordered differently and the features were coded differently, in a way that should not affect performance but makes it non obvious to descramble. Because of the (unlikely) possibility that someone would spend time descrambling, we decided to give a single prize in the slow challenge, not to encourage people to cheat.

If I submit results before the fast track deadline, will those results also enter the slow track if I submit nothing afterwards?

Yes. For each deadline, your last valid complete entry will be entered in the ranking. So if you submit only to the fast track, your results will automatically be entered in the slow track.

If I win in both tracks, will I cumulate prizes?

No. You will get the largest of the two prizes. The remaining money will be used to give travel grants to other deserving participants.

On the result page, there is a "Score" column in the table, what does it mean?

As explained on the Tasks page, the score is the arithmetic mean of the AUC for the three tasks (churn, appetency. and up-selling).

I see a bunch of xxxx instead of my score, is there a problem?

No. Until the data labels of the tasks of the challenge are released, if people submit something on those tasks, they cannot see results to prevent them from gaining information by guessing. Only results on the toy problem are shown. You may still practice submitting some random values to test the system, but you will not see the results.

DISCLAIMER

Can a participant give an arbitrary hard time to the organizers?

ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS". ORANGE AND/OR OTHER ORGANIZERS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY RIGHTS. IN NO EVENT SHALL ORANGE AND/OR OTHER ORGANIZERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE CHALLENGE.

[Go Top]

Organisation

[Go Top]

KDD Cup 2009: Customer relationship prediction