The project is the main grading item of the course. It will allow you to chose a dataset and question of interest, and run an analysis all the way to communicating its results. The project is done in the same 3-person groups you have formed for doing the homework.

Schedule

The schedule for the projects is as follows:

  • Oct. 19, 2018: datasets for projects are announced.
  • Nov. 04, 2018 (23:59 CET): milestone 1 (10%): the project repo contains a README describing your project idea (title, abstract, questions, dataset, milestones, according to a provided skeleton).
  • Nov. 25, 2018 (23:59 CET): milestone 2 (20%): the project repo contains a notebook with data collection and descriptive analysis, properly commented, and the notebook ends with a more structured and informed plan for what comes next (all the way to a plan for the presentation). These sections of the notebook should be filled in by milestone 3.
  • Dec. 16, 2018 (23:59 CET): report (50%): a 4-page PDF document or a data story in a platform of your choice (e.g., a blog post, or directly in GitHub), plus the final notebook (continuation of milestone 2).
  • Jan. 21, 2019 (13:00-17:00, BC Atrium): presentation (20%) of posters and (optionally) whatever else tickles your fancy (e.g., on-screen demos). See below for details.

Therefore, the bulk of your work should be over before Christmas, in order for you to focus on the exam and presentation (and exams of other classes).

Selecting a project (milestone 1)

Your first task is to select a project. This year’s theme is “data science for social good”, so think about how you could improve society through data analysis! We provide a variety of datasets that you can choose from. We also provide some example project ideas (below) for each dataset. You can pick a project idea already there, or even better, come up with a new one. Remember that these project ideas are just ideas: you’re expected to be inspired by them and develop them into a project. Even if we provide these datasets for you to use and we reasonably checked them in advance, we’re also not familiar with all of them; it is your responsibility to check that what you propose is feasible given the datasets at hand. You may also work with other datasets not provided by us, but unless you can strongly convince us that you know what you’re doing and are motivated to achieve it, you’re asked to use one of the provided datasets.

For the first deadline, you’re asked to clone/fork the project skeleton repository. Have your project folder in your group’s repository. Then, you need to fill in the README following its structure (details are given there). At the deadline, we will read all project ideas and we will come back to you if we think there is a problem with your proposal. You’ll also have time during labs to ask about projects, and we will showcase projects from last year’s run of the class. We will then devote lab time on feedback on the project ideas.

We will evaluate this milestone according to how clear, reasonable and well thought-through the project idea and the dataset choices are. Please use the first milestone to really check with us that everything is in order with your project (idea, data, feasibility, etc.) before you advance too much with the next milestone!

Data collection and description (milestone 2)

The second task is to intimately acquaint yourself with the data, preprocess it and complete all the necessary descriptive statistics tasks. We expect you to have a pipeline in place, fully documented in a notebook, and show us that you’ve advanced with your understanding of the project goals by updating its README description.

When describing the data, in particular, you should show (non-exhaustive list):

  • That you can handle the data in its size.
  • That you understand what’s into the data (formats, distributions, missing values, correlations, etc.).
  • That you considered ways to enrich, filter, transform the data according to your needs.
  • That you have updated your plan in a reasonable way, reflecting your improved knowledge after data acquaintance. In particular, discuss how your data suits your project needs and discuss the methods you’re going to use, giving their essential mathematical details in the notebook.
  • That your plan for analysis and communication is now reasonable and sound, potentially discussing alternatives to your choices that you considered but dropped.

We will evaluate this milestone according to how well these previous steps (or other reasonable ones) have been done and documented, the quality of the code and its documentation, the feasibility and critical awareness of the project.

Report (milestone 3)

The report can take two very different forms:

  1. A 4-page, double-column PDF report (4 pages excluding references; you can find the LaTeX file to use in the project skeleton repository), following a standard structure (where applicable): abstract, introduction, related work, (brief) data collection, dataset description with summary statistics, methods with math and description of main algorithms, results and findings, conclusions. This report will be evaluated according to how clearly and succinctly it is written, if the style is appropriate (incl. proper figures etc.), if it contains all relevant contents, and how convincing the results of the analysis are.
  2. A data story: Data stories take the form of a blog post or short article, with an important visual component, using data to tell a story and illustrate it effectively. You can be less formal here (although methods and math should then appear in the notebook), but more visual. You can pick your preferred platform option, but we encourage you to use Jekyll. You can either submit the story in an appropriate standalone format, or a link to it online. Examples of data stories will be given during laboratories (also see examples here and here).

In both cases, a (single!) supporting notebook (ideally extending the one delivered for milestone 2) is also expected and will be graded. The README in milestone 3 shall be updated detailing the contributions of all group members (including who will work on the final presentation).

Example

John: Plotting graphs during data analysis, crawling the data, preliminary data analysis; Mary: Problem formulation, writing up the report, coming up with the algorithm; Chris: Coding up the algorithm, running tests, tabulating final results.

Presentation

You can use a poster (of any shape strictly above A4, knowing that readability is a factor) plus optionally anything you want (e.g. a website, an installation, an on-screen demo, etc.). Presentation: max 3 minutes plus some questions, ideally one person talks (more efficient). Try to rehearse in advance: there are 100 projects, so we will need to interrupt if you go extra time. We will evaluate the quality and clarity of the poster and its presentation, respect of timing, plus any extras you might have included. Please try to be visual: the ideal poster conveys its main ideas in a minute. Here is a gallery of good posters from last years.

Poster logistics

  • Location and timing: Jan. 21, BC Atrium, 13:00-17:00.
  • You are advised to come with you poster at least 10m in advance, as we will start at 13:00. We will be there to help you.
  • Printing and paying: self-managed (please try to plan in advance and not invade the reprographie the day before!); cf. https://repro.epfl.ch.
  • You’ll be given a detailed schedule in due course, so you can anticipate more or less when you’ll need to be there.
  • There will be 5 best-project awards, which will entail a full ticket (incl. gala dinner; worth CHF 199 per ticket) for every member of the group for the Applied Machine Learning Days at the SwissTech Convention Center (Jan. 26-29).
  • How the awards will work: We will select 10 finalist groups from Milestones 1-3 (they will be told in advance, in order to be able to prepare). The 10 finalists will give brief, 3-minute lightning talks at the poster session, which all can attend. The 5 winning groups will be chosen by the ADA team and will be announced immediately after the poster session.

Project ideas

[NEW] Effect of daylight savings time on Web usage

Recently, there has been a lot of debate on whether daylight savings time should be abolished or maintained. Let’s look into the question using data! Are the effects of that hour that we gain in fall, and lose in spring, noticeable in digital traces on the Web (e.g., tweets, Wikipedia edits and views, etc.)? Do users immediately shift their schedules, or is there some lag time until they adapt?

[NEW] Effect of names on success

It has been argued that people whose name starts with a letter that comes early in the alphabet are on average more successful than people whose name starts with a letter coming late, possibly because the former are more salient on class lists etc. Can you support or refute this hypothesis with data? For instance, are there more people in Wikipedia whose name starts with an A, than people whose name starts with a Z? (You’ll need to open your observational-study toolbox to make meaningful claims here!)

[NEW] How are Bible (or Quran etc.) quotes used on Twitter?

A simple but important question, given the role of religion in extremism. Inspiration: America’s Public Bible.

Don’t judge a book by its cover!

Do cover aestetics matter? Analyze book covers with respect to their reviews. Does the color palette or other visual elements impact review scores and sentiments? Are there “good cover pratices” and “bad cover pratices” you can find? Use the Amazon dataset, and be sure you know some machine learning.

Media polarity

Media are often supporting an agenda on their own. Can this agenda be discovered by analysing how differently they report the same news? Build a polarity profile for different media over time and topics, considering a choice of the variety of datasets available (200y, News On The Web, etc.).

Why inequality?

Are there geographical, historical, cultural or other predictors of inequality among countries? Enrich the Atlas dataset with country specific historical data from Wikidata or elsewhere, and explore some motivations for current trading inequalities: are there recurring patterns? E.g. consider culture, religion, government, geographic location, etc.

Globalization of the news

The world is becoming increasingly connected. Does the trend toward globalization also exist in the news? If so, since when? Detect and describe the ratio between international and local events through the years in a dataset of news (ideally, 200y where named entities are already provided).