Hide

Problem G
Spudcast

Languages en sv

You have decided to enter the lucrative world of podcasts by creating one called Spudcast together with your AI friends, but it is not without problems. You are, in fact, incredibly bad at writing podcast scripts with timestamps, which your AI friends need in order to participate. To make this process easier, you came up with the clever idea of stealing podcasts from others and converting them into podcast scripts with timestamps. The content itself – that is, the spoken text – was not so difficult for you to obtain, but the timestamps are nowhere to be found. Therefore, you have decided to create a model that does this for you for all podcasts.

For this task, the training data is provided and it is not allowed to find your own training data on the internet.

Input

Download the zip file with training data and test data. This can be found at the bottom under "attachments". You will receive a zip file that contains:

  • train - Folder containing audio files of one person speaking.

  • test - Folder containing audio files where one or more people speak, which is what you must provide answers for.

  • test.txt - Text file stating how many speakers there are in total in each audio file in the test folder.

  • baseline.ipynb - How to load the mp4 file and convert it to a simpler format.

  • baseline.py - An example of what a submission to Kattis should look like.

Output

Note that you should not have any spaces between the times, this will lead to wrong answer. For each audio file in test you must output when each speaker is speaking (separated by newlines); it does not matter in which order these are output. This is done in the following format: $[s1-e1,s2-e2,s3-e3] [s1-e1,s2-e2] [s1-e1,s2-e2,s3-e3,s4-e4]$ if you have 3 people, and $s1, e1$ represent the first start time and end time when that person is speaking, in the format: "mm:ss".

Scoring

Note that we will always arrange your speakers in a way that gives you the best possible score. If you indicate that someone is speaking at a second when they are not speaking, you get -1 point; if you indicate that someone is speaking at a second when they are speaking, you get 1 point. You can indicate the speakers in any order you want, and we will assign who they are so that you get the best possible score. If we say that $S$ is the sum of all these points and point deductions across all test files, your final score is:

\[ \text{Score} = \max (0, \min (100, \sqrt{\frac{S}{4000}}\times 100 )) \]

Please log in to submit a solution to this problem

Log in