Applied Data Analysis (ADA)

The project is an important grading item of the course. It will allow you to chose a dataset and question of interest, and run an analysis all the way to communicating its results. The project is done in the same 4-person groups you have formed for doing the homework.

Note: When in doubt, please refer the Project Intro slides.

Schedule

The schedule for the projects is as follows:

Oct. 28, 2019 (23:59 CET): milestone 1 (15%): the project repo contains a README describing your project idea (title, abstract, questions, dataset, milestones, according to a provided skeleton).
Nov. 25, 2019 (23:59 CET): milestone 2 (15%): the project repo contains a notebook with data collection and descriptive analysis, properly commented, and the notebook ends with a more structured and informed plan for what comes next (all the way to a plan for the presentation). These sections of the notebook should be filled in by milestone 3.
Dec. 20, 2019 (23:59 CET): report (milestone 3) (50%): a 4-page PDF document or a data story in a platform of your choice (e.g., a blog post, or directly in GitHub), plus the final notebook (continuation of milestone 2).
Jan. 20, 2020 (13:00-17:00, BC Atrium): presentation (20%) of posters and (optionally) whatever else tickles your fancy (e.g., on-screen demos). See below for details.

Therefore, the bulk of your work should be over before Christmas, in order for you to focus on the exam and presentation (and exams of other classes).

Selecting a project (milestone 1)

Your first task is to select a project. This year’s theme is “data science for social good”, so think about how you could improve society through data analysis! We provide a variety of datasets that you can choose from. Even if we provide these datasets for you to use and we reasonably checked them in advance, we’re also not familiar with all of them; it is your responsibility to check that what you propose is feasible given the datasets at hand. You may also work with other datasets not provided by us, but unless you can strongly convince us that you know what you’re doing and are motivated to achieve it, you’re asked to use one of the provided datasets.

For the first deadline, you’re asked to clone/fork the project skeleton repository. Have your project folder in your group’s repository. Then, you need to fill in the README following its structure (details are given there). At the deadline, we will read all project ideas and we will come back to you if we think there is a problem with your proposal. You’ll also have time during labs to ask about projects, and we will showcase projects from last year’s run of the class. We will then devote lab time for feedback on the project ideas.

We will evaluate this milestone according to how clear, reasonable and well thought-through the project idea and the dataset choices are. Please use the first milestone to really check with us that everything is in order with your project (idea, data, feasibility, etc.) before you advance too much with the next milestone!

Data collection and description (milestone 2)

The second task is to intimately acquaint yourself with the data, preprocess it and complete all the necessary descriptive statistics tasks. We expect you to have a pipeline in place, fully documented in a notebook, and show us that you’ve advanced with your understanding of the project goals by updating its README description.

When describing the data, in particular, you should show (non-exhaustive list):

That you can handle the data in its size.
That you understand what’s into the data (formats, distributions, missing values, correlations, etc.).
That you considered ways to enrich, filter, transform the data according to your needs.
That you have updated your plan in a reasonable way, reflecting your improved knowledge after data acquaintance. In particular, discuss how your data suits your project needs and discuss the methods you’re going to use, giving their essential mathematical details in the notebook.
That your plan for analysis and communication is now reasonable and sound, potentially discussing alternatives to your choices that you considered but dropped.

We will evaluate this milestone according to how well these previous steps (or other reasonable ones) have been done and documented, the quality of the code and its documentation, the feasibility and critical awareness of the project.

Report (milestone 3)

The report can take two very different forms:

A 4-page, double-column PDF report (4 pages excluding references; you can find the LaTeX file to use in the project skeleton repository), following a standard structure (where applicable): abstract, introduction, related work, (brief) data collection, dataset description with summary statistics, methods with math and description of main algorithms, results and findings, conclusions. This report will be evaluated according to how clearly and succinctly it is written, if the style is appropriate (incl. proper figures etc.), if it contains all relevant contents, and how convincing the results of the analysis are.
A data story: Data stories take the form of a blog post or short article, with an important visual component, using data to tell a story and illustrate it effectively. You can be less formal here (although methods and math should then appear in the notebook), but more visual. You can pick your preferred platform option, but we encourage you to use Jekyll. You can either submit the story in an appropriate standalone format, or a link to it online. Examples of data stories will be given during laboratories (also see examples here and here).

In both cases, a (single!) supporting notebook (ideally extending the one delivered for milestone 2) is also expected and will be graded. The README in milestone 3 shall be updated detailing the contributions of all group members (including who will work on the final presentation).

Example
John: Plotting graphs during data analysis, crawling the data, preliminary data analysis;
Mary: Problem formulation, coming up with the algorithm;
Chris: Coding up the algorithm, running tests, tabulating final results;
Eve: Writing up the report or the data story, preparing the final presentation.

Presentation

You can use a poster (of any shape strictly above A4, knowing that readability is a factor) plus optionally anything you want (e.g. a website, an installation, an on-screen demo, etc.). Presentation: max 3 minutes plus some questions, ideally one person talks (more efficient). Try to rehearse in advance: there are around 90 projects, so we will need to interrupt if you exceed the alotted time limit. We will evaluate the quality and clarity of the poster and its presentation, respect of timing, plus any extras you might have included. Please try to be visual: the ideal poster conveys its main ideas in a minute. Here is a gallery of good posters and not so good posters from last years.

Poster logistics

Location and timing: Jan. 20, BC Atrium, 13:00-17:00.
You are advised to come with you poster at least 10m in advance, as we will start at 13:00. We will be there to help you.
Printing and paying: self-managed (please try to plan in advance and not invade the reprographie the day before!); cf. https://repro.epfl.ch.
You’ll be given a detailed schedule in due course, so you can anticipate more or less when you’ll need to be there.
There will be 5 best-project awards.
How the awards will work: We will select 10 finalist groups from Milestones 1-3 (they will be told in advance, in order to be able to prepare). The 10 finalists will give brief, 3-minute lightning talks at the poster session, which all can attend. The 5 winning groups will be chosen by the ADA team and will be announced immediately after the poster session.