I have problems with the ZIP files which appear to be corrupted. Can I get a DVD?
Try do download one archive at a time. If the problem persists, contact the organizers so they can send you a DVD.
Are the data available in other formats: matlab, SAS, etc.?
There are several Matlab versions posted on the Forum. There is also a numerical version of the categorical variables in text format for the large dataset. Please post your own version of the data to share it with others.
Is there sample code available?
Yes. We made available sample Matlab codeto help you format your results. There are also examples to call CLOP models from that code. AT THIS STAGE THERE IS NOT YET MATLAB SUPPORT FOR HANDLING THE LARGE DATASET.
Are the true targets distributed similarly as the toy target?
No. The toy target is generated by an artificial stochastic process. The proportion of examples in either class is different in the real targets. The real targets have less than 10% of examples in the positive class.
I have observed that the last columns (after variable 14740) are not numerical, are the data corrupted?
The last variables are categorical variables. The strings correspond to category codes. This could be for instance a city name. But for reasons of privacy, the real names were replaced by strings that are meaningless.
I have observed that some columns are empty or constant, are the data corrupted?
No. This is correct, and part of the challenge, that deals with automatic data preparation and modeling in the context of industrial real data. Filtering constant data is the easy part of the challenge.
I have observed that the first chunk of the large dataset contains only 9999 lines, is this correct?
Yes. Chunk 1 contains 9999 data lines plus the header. All other chunks have no header. The last chunk has 10001 lines. So the total is 50000 lines of data.
In the categorical variables, do the value need to be handled as meaningful sequences or are they just codes?
The original categorical values where symbols, not indicating any category ordering. The category symbols have been replaced by random anonymized values (strings) with no semantic, in 1 to 1 bijection with the original values so as to keep the structure of the data.
Do the targets correspond to single or multiple products?
The targets correspond to single products (but not necessarily the same one). For instance, churn concerns mobile phone customers switching providers and up-selling the plan upgrade to include television.
Is there a meaning in the variable ordering?
No. The variables are randomly ordered.
Are the variables in the small dataset a subset of those in the large dataset?
Yes. However, they are disguised to make it non-trivial to identify and discourage people to do so. The examples are also ordered differently to render such mapping even harder. We wish that participants work on each dataset separately, although they may work on both.
Are the training and test data drawn from the same distribution?
Yes.
Are the set of categorical variable values the same in the training and test data?
Not necessarily. Some values might show up only in training data or only in test data.
Are there the same number of values in each line?
There can be missing values. The values are separated by tabulations. Two consecutive tabs indicate a missing value.
Is it allowed to unscramble the small dataset?
Scrambling was done to encourage the participants to work separately on the small dataset and the big dataset. If we wanted the participants to be able to use the features of the small dataset in addition to those they might select from the big one, we would not have scrambled the data. We realize however that, if we forbid the participants from unscrambling and consider it cheating, we would have difficulties enforcing that rule. Hence, participants who unscramble the small dataset will not be disqualified from the competition. All participants will be requested to report at the end of the challenge whether they made use of unscrambling and whether they derived some advantage from it.