Commercial Detection

Google Summer of Code 2015, Red Hen Labs

View project on GitHub

Building an automated commercial detection system

Think about this situation, you are watching a really enthralling movie that you have recorded on your TV. You are engrossed into the movie and as it proceeds, the stage is set for the plot to unfold, you are on your toes, eagerly waiting to see how it culminates and snap! its a commercial break!. I am sure all of you have been in this situation, if you haven't, let me tell you, its one of the most annoying things about television. But wait, what if we could eliminate this problem completely ? What if there was a system that could completely remove all the ads you hate for you?

This is precisely the system that Red Hen Labs plans on having, an automated and robust commercial detection system, one that is generic for all TV, regardless of the language or the genre of the TV show. I am really grateful that I have been given an opportunity to build such a system under the guidance of Carlos Fernández Sanz and Weixin Li as a part of Google Summer of Code, 2015.

This blog is meant as a general report of progress for my project, but I plan on making it a tutorial to both easily use this system and to develop your own from scratch. If you ever wanted to do some fun signal processing, image processing and machine learning all at the same time, this would be a great project to get your hands on.

Using the system

Currently, the instructions are for Debian based Linux systems. Install the dependencies by running the script dependencies.sh and download the MySQL db. Then run this:

python main.py <path-to-tv-recording>

This will create a file called output.txt, that contains the location of all the commercials. The output will be like this:

00:00:00 - 00:09:38 = TV
00:09:38 - 00:09:52 = Commercial by Pepsi
00:09:52 - 00:10:05 = Commercial by Nike
....
.

Do you just enjoy watching Nike ads ? Great, then just delete these lines from output.txt, you do not want any ads ? Then just keep the file as is. In the end run the following code:

This part is still under development

python remove_ads.py <output.txt> <path-to-tv-recording> <new-name-of-video>

This is it, you have your recorded TV, with only your favorite commercials in them, you can go watch your TV show then.

Building one from scratch

This part of the blog will be edited, until the end of the project, so if you are seeing it, any time before September 2015, know that, its not the final version of the blog.

How humans detect commercials

The most naive way of building any automated system would be to first think of how a human does it, then try mimicking that. So, how do you differentiate between a commercial and TV? Its amazing how, once you start to think this way, even the most trivial tasks such as this one, appear so convoluted. Its seemingly too easy for us as humans to detect commercials under normal situations, but its a really hard thing for a computer to do, so to even out the scales a little, lets put a human in a different situation, one where this task won't appear quite so trivial and then mimic the path the human took to learn, under such a situation.

Let us think of this hypothetical situation, imagine all your memory about TV shows, movies, advertisements, everything was magically wiped out. Now you switch on the TV, you tune in to a channel which has a language you don't know. So you are basically destitute of any means of recognizing commercials based on previous memory or by understanding the language. The TV is turned on right in the middle of an ongoing TV show / ad, so you don't have any clue about which part you are in. You continue watching and you realize in a really short time, whether you are in a middle of a TV or an ad. You form this conclusion, based on how the scenes changed, on how people talk, from the tone of the background music and other such things, you just know its TV or ad, even though you have a hard time explaining to people about how you figured that out (this is the convoluted part I was talking about).

Now you can differentiate between that particular TV show and ads, and you watched TV for less than 10 minutes, staggering progress! Our task is not yet done though, you have to detect the ads, that is, see what they are trying to advertise. Remember, you can't read anything on the screen, because you don't know the language. So just how do you know what product they are advertising then ? Its not possible, so you need the help of a person who knows the language here, or someone who can identify the company logo for you, then you enter a report telling when the ad occurred and what it was advertising. This is how you learned to detect commercials. With time, you will learn these logos yourself or may be even the pattern of text in the language and be able to detect new commercials of the same company, provided they don't change their logo. One of the good things is, once you have seen a commercial and you remember it, you can detect its occurrence quickly in just a matter of seconds. Hence, even being placed under this hypothetical situation, your brain has become a commercial detector. Now that you have a glimpse of how the human brain does it, lets proceed to the more exciting part of mimicking it.

Mimicking the human brain

The only resource at my disposal was access to HD videos of TV recordings, my task was to build a commercial detection system from this. My first thought was, a human divides the video in to blocks, based on how different the blocks are and then designates whether that block was a commercial or not. Even though this exact process of how the mind divides it into blocks and later classifies them was rather obscure, I went ahead to implement it. After spending a lot of hours and coming up with a not so robust system to do this, I discussed it with Carlos, who made me realize that I should be looking at teaching the system(building a supervised learning algorithm) rather making it completely learn by itself.

Now I had to teach the computer, to differentiate between a commercial and TV and also detect the commercial if possible, how could I teach something I don't know myself? So, I went through one of the recorded videos and made a report, this report looked like the output file that should be generated by the system. This is what is called my training data. Using ffmpeg, these regions of commercials were copied out from the video and stored separately in a folder. A csv file was made, which contained the path to these hand picked commercials and the name and duration of each of them. It also had a field verified(will come to this later), the csv file just acted as a user friendly database.

Let us come back to the brain analogy, if the brain under that hypothetical situation I described earlier had seen these commercials, we know that it would surely identify any occurrence of the same in any video, in seconds. This was the first thing I had to teach the computer to do, a 100% recognition of seen commercials. The words 100% and recognition when put together are really intimidating for any one with some background in machine learning. But if you look more closely, every commercial has some audio, we had to find a method to find a 100% match of the same signal in other videos. Mattia Cerrato, a student developer at the time, at Red Hen Labs, suggested that audio fingerprinting would be just the thing to achieve this.

If you want to read more about audio fingerprinting, you should really visit Will Drevo's blog , its one of the best explanation on the topic I could find on the web. He explains the working of his open source project Dejavu, which does audio fingerprinting, in it. I integrated Dejavu as a part of the system, so now I could achieve an accuracy of 100% detection of already seen commercials.

Implementing audio fingerprinting using Dejavu

We obtain the audio of the entire video using ffmpeg. The pseudo code for the matching process is follows:

//Initially, building the fingerprint database
for item in database:
     //build fingerprint for item and store this fingerprint in a Mysql db(done by Dejavu)

//Recognizing commercials in the given video
start = 0
end = end time of video
audio = audio of the video
fingerprint_length = 5 seconds // 5 seconds was apt to get really good fingerprints, more stats are in Dejavu docs
skip_length = (Duration of the shortest commercial in db) - (fingerprint_length) 
while start < end:
       //Obtain start to start + 5 seconds of audio and fingerprint it
       if there is a match for the fingerprint:
             //Obtain the offset of the fingerprint, done by Dejavu
             com_start = start - offset
             com_end = start + duration // duration obtained from the handpicked database
             //Write com_start and com_end into the output file, along with name of the commercial
             start = start - offset + duration // Skip directly to end of detected commercial and continue scanning
       else: 
             start = start + skip_length

Thus, after doing these steps, we obtain a system that can recognize seen commercials like a human. But this means we have to teach it every commercial on the planet to make it robust, and keep teaching it too, this is not feasible by any means, so we will proceed to make it automated.

Finding scene changes

Things start to become a little technical from here, so please do bear with me. While the above approach is fast, it is a heuristic so it may not always work since the start point may not be detected correctly. To fix this problem, we should delve deeper into how a video to be broadcast on TV is created.

A TV broadcast is basically just a series of videos merged together into one unit. Many advertisements and many blocks of the same TV show are combined in the desired order to obtain a broadcast. Lets call the points where two videos are combined a scene change. If we detect these scene changes and run our algorithm only near those points of scene change, we can obtain a gain in efficiency and accuracy too. If we are good with our scene change detection we make the entire system completely robust.

On careful inspection, to make a good scene change, the videos have to be stitched together until there is no blank frame between the two videos, this way humans cannot detect scene changes and the transition will be smooth. From the laws of physics we know the light travels faster than sound, so during a video stitching, if there was no silence between the audio content of the combining videos, you would be seeing a little of the second video while hearing the last part of the first video, this would make the transition atrocious, for this reason they always leave a gap of silence at the scene change point.

Let us use this point of silence to detect scene changes. We can extract the audio for the video, using ffmpeg as stated before. If we were to naively find points in this audio where sound level is zero, we would fail since there would be many such points and most of them might not be the true points. Some papers (1, 2, 3, 4, 5, 7, 8, 11) have resorted to find duration of silence to find scene changes, although this works, it will give rise to many false positives, this would increase the running time of the algorithm and also will not leave the system completely robust. This proved that spatial information of the audio signal did not suffice to find these scene changes.

The frequency domain of the audio signal was obtained, this gave more data to work with. One of the best papers that used this frequency domain data for good scene change detection was paper 14 (in references), it treated the frequency components for every sample as a vector and found the cosine similarity between the vectors. The reason cosine similarity was used was to keep the distance measure oblivious of the magnitudes of the vector. Although this approach performed much better than the previous spatial domain based approaches, it too gave too many false positives and was hence discarded in the end. (The implementation of the algorithm can be found here https://github.com/vasanthkalingeri/CommercialDetection/tree/aa972690e5e5955e80252ddeb790723b46709883)

Using audacity, we can plot the power spectrum of the audio signal so that we can manually determine if anything peculiar can be observed at scene changes. One such shot is here.

Power spectrogram

The top axis, gives the time and the left axis the frequency, the color of the plot is based on the variation of amplitude values at the (frequency, time) pair. The color system follow is like the rainbow's VIBGYOR, red indication presence of high amplitudes and blue that of low amplitudes. From this we observe that the amplitude is very low at a scene change( 1 min, 8 seconds and 15 milliseconds), sometimes this amplitude is zero. We can hence assign a score at every time period and based on the score classify whether it is a scene change point or not.

Let us delve deeper into the data we have for every time period. We can see that during a shot change, the minimum and maximum amplitudes are very low. So a score could be, the difference of minimum and maximum amplitudes at every time period. This score seems apt, since it would help us detect scene changes, but what if the minimum and maximum frequencies were high, but the difference low, then this score would fail. So a score that would work is

 score = (maximum + minimum) / variance

The above score would be high for scene changes and extremely low for normal segments of audio. The intuition behind this is, the sum of maximum and minimum would be low for a scene change, which would never be this low for normal segments. The variance would take a step further and add more confidence to the score, since the variance would be very low for a scene change and high otherwise, the score would shoot up at scene changes. Practically this works with a true positive rate of 100% and a false positive rate of 5%, this would be good enough for our purpose.

I am sure you noticed that theoretically this fails. It fails when all the amplitudes are nearly equal(so variance is low, shooting up the score). If we convert such a frequency domain data to spatial domain, the audio would sound like terrible unbearable noise to the human ear that cannot be replicated naturally by any instrument or human. In TVs however such an audio usually means an error in transmission, so we can safely use this score to detect scene changes. This is the only assumption made while detecting scene changes and as stated this is by far a very safe assumption to make.

Our pseudo code for scene change detection and efficient audio fingerprinting now would be as follows

//Find audio of video using ffmpeg
//find Fast fourier transform with a 3ms window This window size was optimal to detect all scene changes fast
//Do not overlap the windows when finding the Fourier transform
For each window:
      //Find the score from the vector
      if score > 0.01:
          //This was optimally adjusted, usually scene changes are greater than 0.01 while normal segments are less than 0.005
          Mark scene change and store this time value
      else:
          proceed

The above algorithm will hence help us obtain the time periods where there was a scene change. In the vicinity of these scene changes we apply audio fingerprinting as we did in the previous algorithm. The implementation of the process till this stage can be found here https://github.com/vasanthkalingeri/CommercialDetection/tree/094a883c3293bd9c406e9e7ec44a2db292ac4148 .

Now we have a system that can detect scene changes, this is dividing the entire video into blocks. So we have essentially re-winded the process of creating a TV broadcast. With these individual blocks of videos in place, our step now is to classify each block to be either a commercial or TV program. This is a really good stage to be in. Using the famous divide and conquer strategy on this problem, we have gone from a problem of detecting commercials in a video to identifying which block/blocks of video represent a commercial. This is a comparatively easier problem to be solving.

We now obtain an output that looks like this

 00:00:00 - 00:00:37 = unclassified
 00:00:37 - 00:00:38 = silence
 00:00:38 - 00:00:52 = ad by jeopardy
 00:00:52 - 00:00:53 = silence
 00:00:53 - 00:01:08 = unclassified

Here ad by jeopardy has been hand picked and fingerprinted. 'Unclassified' are the blocks that it has divided the video into, we have to classify each of these blocks and our system would be complete.

Notice that the output file looks very similar to the input labels file, this is kept this way for a reason which you will find out in later parts of the blog.

Identifying commercials

Work under progress

Teaching the computer

Our hypothetical human could not detect commercials either, without an expert teaching him which commercial each video segment is. It would require a lot of background knowledge for a computer to detect new commercials just by seeing them, since this system is not a strong AI, this cannot be expected out of it. So we design a system to teach it commercials which the system tags as unknown.

(I am skipping the parts of instructions about how to create a web interface since there are many better resources for that)

This is the simplest of all steps. We create an interface to easily edit the output given by the first run of the system, the interface mainly comprises of a human being able to see each unknown block and update its name. The new updated output file is then used as the 'labels' files for audio fingerprinting, this time only those names which are not already present in the database are used, hence this way new commercials are learnt.

This is the version of the system currently on git, it requires a little work to be done with the interface for it to work more smoothly.

References

  1. Real time commercial detection using MPEG features - http://www.researchgate.net/profile/Nevenka_Dimitrova/publication/229024227_Real_time_commercial_detection_using_MPEG_features/links/00b7d52b9b0879c1bc000000.pdf

  2. Automatic detection of TV commercials - http://www.researchgate.net/profile/Oge_Marques/publication/3227702_Automatic_detection_of_TV_commercials/links/0c96051e533d12d515000000.pdf

  3. A high-performance shot boundary detection algorithm using multiple cues

  4. Story Segmentation and Detection of Commercials In Broadcast News Video - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46.6845&rep=rep1&type=pdf

  5. CONSTRUCTION AND EVALUATION OF A ROBUST MULTIFEATURE SPEECH/MUSIC DISCRIMINATOR - http://www.ee.columbia.edu/~dpwe/papers/ScheiS97-mussp.pdf

  6. On the detection and recognition of television commercials - https://ub-madoc.bib.uni-mannheim.de/800/1/TR-96-016.pdf

  7. Time constraint boost for TV commercial detection - http://research.microsoft.com/en-us/um/people/taoqin/papers/qin-icip04.pdf

  8. A confidence based recognition system for TV commercials - http://crpit.com/confpapers/CRPITV75Li.pdf

  9. Comparision and combination of two novel Commercial detection methods - http://www.cs.cmu.edu/~mychen/publication/duygulu_ICME04.pdf

  10. A psuedo statistical approach to commercial detection - http://lyle.smu.edu/~prangara/Report_Commercial_Boundary_Detection.pdf

  11. Real time commercial detection in video - http://www.cse.msu.edu/~fengzhey/downloads/projects/before2015/Comcast-2013.pdf

  12. Detection of commercials using sift - http://www.ijetae.com/files/Volume4Issue6/IJETAE_0614_82.pdf

  13. Robust learning based TV commercial detection - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.9758&rep=rep1&type=pdf

  14. Automatic audio segmentation using a measure of novelty - http://www.fxpal.com/publications/FXPAL-PR-00-094.pdf