A Small Python Project (coursera-dl) Activites

Lately, I have been dedicating a lot of my time (well, at least compared to what I used to) to Free Software projects. In particular, I have spent a moderate amount of time with two projects written in Python.

In this post, I want to talk about the first, more popular project is called coursera-dl. To be honest, I think that I may have devoted much more time to it than to any other project in particular.

With it I started to learn (besides the practices that I already used in Debian), how to program in Python, how to use unit tests (I started with Python's built-in unittest framework, then progressed to nose, and I am now using pytest), hooking up the results of the tests with a continuous integration system (in this case, Travis CI).

I must say that I am sold on this idea of testing software (after being a skeptical for way too long) and I can say that I find hacking on other projects without proper testing a bit uncomfortable, since I don't know if I am breaking unrelated parts of the project.

My use/migration to pytest was the result of a campaign from pytest.org called Adopt Pytest Month which a kind user of the project let me know about. I got a very skilled volunteer assigned from pytest to our project. Besides learning from their pull requests, one side-effect of this whole story was that I spent a moderate amount of hours trying to understand how properly package and distribute things on PyPI.

One tip learned along the way: contrary to the official documentation, use twine, not python setup.py upload. It is more flexible for uploading your package to PyPI.

You can see the package on PyPI. Anyway, I made the first upload of the package to PyPI on the 1st of May and it already has almost 1500 download, which is far more than what I expected.

A word of warning: there are other similarly named project, but they don't seem to have as much following as we have. A speculation from my side is that this may be, perhaps, due to me spending a lot of time interacting with users in the bug tracker that github provides.

Anyway, installation of the program is now as simple as:

pip install coursera

And all the dependencies will be neatly pulled in, without having to mess with multi-step procedures. This is a big win for the users.

Also, I even had an offer to package the program to have it available in Debian!

Well, despite all the time that this project demanded, I think that I have only good things to say, especially to the original author, John Lehmann. :)

If you like the project, please let me know and consider yourselves invited to participate lending a hand, testing/using the program or [triaging some bugs][issues].

User-Agent strings and privacy

I just had my hands on some mobile devices (a Samsung's Galaxy Tab S 8.4", an Apple's iPad mini 3, and my no-name tablet that runs Android).

I got curious to see how the different browsers identify themselves to the world via their User agent strings and I must say that each browser's string reveals a lot about both the browser makers and their philosophies regarding user privacy.

Here is a simple table that I compiled with the information that I collected (sorry if it gets too wide):

Device Browser User-Agent String
Samsung Galaxy Tab S Firefox 35.0 Mozilla/5.0 (Android; Tablet; rv:35.0) Gecko/35.0 Firefox/35.0
Samsung Galaxy Tab S Firefox 35.0.1 Mozilla/5.0 (Android; Tablet; rv:35.0.1) Gecko/35.0.1 Firefox/35.0.1
Samsung Galaxy Tab S Android's 4.4.2 stock browser Mozilla/5.0 (Linux; Android 4.4.2; en-gb; SAMSUNG SM-T700 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Version/1.5 Chrome/28.0.1500.94 Safari/537.36
Samsung Galaxy Tab S Updated Chrome Mozilla/5.0 (Linux; Android 4.4.2; SM-T700 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.109 Safari/537.36
Vanilla tablet Android's 4.1.1 stock browser Mozilla/5.0 (Linux; U; Android 4.1.1; en-us; TB1010 Build/JRO03H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Safari/534.30
Vanilla tablet Firefox 35.0.1 Mozilla/5.0 (Android; Tablet; rv:35.0.1) Gecko/35.0.1 Firefox/35.0.1
iPad Safari's from iOS 8.1.3 Mozilla/5.0 (iPad; CPU OS 8_1_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B466 Safari/600.1.4
Notebook Debian's Iceweasel 35.0.1 Mozilla/5.0 (X11; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0 Iceweasel/35.0.1

So, briefly looking at the table above, you can tell that the stock Android browser reveals quite a bit of information: the model of the device (e.g., SAMSUNG SM-T700 or TB1010) and even the build number (e.g., Build/KOT49H or Build/JRO03H)! This is super handy for malicious websites and I would say that it leaks a lot of possibly undesired information.

The iPad is similar, with Safari revealing the version of the iOS that it is running. It doesn't reveal, though, the language that the user is using via the UA string (it probably does via other HTTP fields).

Chrome is similar to the stock Android browser here, but, at least, it doesn't reveal the language of the user. It does reveal the version of Android, including the patch-level (that's a bit too much, IMVHO).

I would say that the winner respecting privacy of the users among the browsers that I tested is Firefox: it conveys just the bare minimum, not differentiating from a high-end tablet (Samsung's Galaxy Tab S with 8 cores) and a vanilla tablet (with 2 cores). Like Chrome, Firefox still reveals a bit too much in the form of the patch-level. It should be sufficient to say that it is version 35.0 even if the user has 35.0.1 installed.

The bonus points with Firefox is that it is also available on F-Droid, in two versions: as Firefox itself and as Fennec.

Uploading SICP to Youtube

Intro

I am not alone in considering Harold Abelson and Gerald Jay Sussman's recorded lectures based on their book "Structure and Interpretation of Computer Programs" is a masterpiece.

There are many things to like about the content of the lectures, beginning with some pearls and wisdom about the craft of writing software (even though this is not really a "software enginneering" book), the clarity with which the concepts are described, the Freedom-friendly aspects of the authors regarding the material that they produced and much, the breadth of the subjects covered and much more.

The videos, their length, and splitting them

The course consists of 20 video files and they are all uploaded on Youtube already.

There is one thing, though: while the lectures are naturally divided into segments (the instructors took a break in after every 30 minutes or so worth of lectures), the videos corresponding to each lecture have all the segments concatenated.

To better watch them, accounting for the easier possibility to put a few of the lectures in a mobile device or to avoid fast forwarding long videos from my NAS when I am watching them on my TV (and some other factors), I decided to sit down, take notes for each video of where the breaks where, and write a simple Python script to help split the videos in segments, and, then, reencode the segments.

I decided not to take the videos from Youtube to perform my splitting activities, but, instead, to operate on one of the "sources" that the authors once had in their homepage (videos encoded in DivX and audio in MP3). The videos are still available as a torrent file (with a magnet link for the hash 650704e4439d7857a33fe4e32bcfdc2cb1db34db), with some very good souls still seeding it (I can seed it too, if desired). Alas, I have not found a source for the higher quality MPEG1 videos, but I think that the videos are legible enough to avoid bothering with a larger download.

I soon found out that there are some beneficial side-effects of splitting the videos, like not having to edit/equalize the entire audio of the videos when only a segment was bad (which is understandable, as these lectures were recorded almost 30 years ago and technology was not as advanced as things are today).

So, since I already have the split videos lying around here, I figured out that, perhaps, other people may want to download them, as they may be more convenient to watch (say, during commutes or whatever/whenever/wherever it best suits them).

Of course, uploading all the videos is going to take a while and I would only do it if people would really benefit from them. If you think so, let me know here (or if you know someone who would like the split version of the videos, spread the word).

Problems with Emacs 24.4

This is, essentially, a call for help, as I don't really know which program is at a fault here.

Given that Emacs's upstream converted their repository from bzr to git, all the commits in mirror repositories became "invalid" in relation to the official repository.

What does this mean in practical terms, in my case? Well, bear with me while I try to report my steps.

Noticing a regression and reporting a bug

There is a regression with Emacs 24.4 relative to 24.3, which I discovered after Emacs 24.4 became available in Debian's sid.

The regression in particular is that Emacs 24.4 doesn't seem to respect my Xresources, while 24.3 does (and this is 100% reproducible: I kept the binary packages of version 24.3 of emacs24 and I can install and reinstall things).

When I reported this to upstream, I received a reply that it worked fine with another person that was using XFCE with unstable.

Testing various Desktop environments

As I am using the MATE desktop environment, I proceeded to test this assertion by installing XFCE. Emacs 24.4 read my Xresources. I went ahead and installed LXDE. It worked again. I tried once more with GNOME 3, but "regular" GNOME 3 just crashed. I tried with GNOME 3 Classic and Emacs 24.4 just worked again.

Going deep into the rabbit's hole

Then, I got more curious and I tried to see why things worked the way that they did and given that there was a mirror of the Emacs repo on github, I cloned it and started to git bisect to find the first problematic commit (I have no idea if bzr even offers something like git's bisect and I wouldn't really know how to do it as quickly as I do with git).

To cut short a long story, after many recompiles, many wasted hours, a lot of wasted electrical energy, I found a bad commit and reported it.

I received no response after that.

The new repo enters in action

Of course, all my hard work bisecting things was completely invalidated after the transition to the new repository went live.

To make things relevant again, I used the awesome powers of git, restricting the changes of the newly cloned repository to the e-mail of the committer in question (Chong Yidong) and, from there, I proceeded to another painful process of git bisects.

And, sure enough, the first bad commit was the same one that I found with the previous tree.

Semi-blindly reverting this commit, and also semi-blindly resolving the conflicts make Emacs's from master work again on my system, but I highly suspect that (given the way that I did it), it would not really be appropriate for upstream.

But given also that I failed to receive feedback after my original report, I am not too confident that this bug can be solved soon (even if it doesn't qualify for being fixed in Debian 8).

After all this, I don't really know what else to do. I even filed a bug report (more like a request for help) to the Debian MATE maintainers.

As a side note, I would have filed a bug to upstream MATE, but it is not really clear what the proper procedure to report bugs to them is---they seem to use github's issues system, but given that they have separate repositories for each component of the project, and that I don't know precisely what repository to report to (or even if it applies to MATE after all), I am more or less paralyzed.

A side note

I must say that the conversion was well done by Eric Raymond, because the whole .git repository of the new repo is only about 200MB, with history going back to 1985, while the other repository had about 800GB.

The importance of flexible deadlines in MOOCs

I have always thought that having flexible deadlines in MOOCs is important, despite not having used them too much. Until last night, that is.

The course in question is Stanford's Statistical Learning and they adopted a policy of letting the students complete all the assignments of the 10 week course until the last day of the course, in March 21. Then, they graciously extended the deadline to April 4.

Were it not for such an extension, I would not have completed the course. I sent them a message on the course forums this morning to thank them for this, as this, in my experience, is not so common with MOOCs.

I reproduce below the message that I posted on the forums:

Subject: Thank you also for the EXTENSION of the deadlines

Dear professors and staff,

I would like to thank you (of course) for the course, as many others have already. Despite not liking edX platform too much (preferring the UI of, say, Coursera), your course was an exception, interesting enough (and funny enough) that I sticked to it until the end.

But I would like to bring attention here to a point that many may not have appreciated (or, perhaps, not expressed as clearly as I thought that it deserved), namely, the extension of the deadlines (and a uniform deadline for all homeworks).

In particular, due to some unfortunate facts of my personal life, I could not work on the course at all in the past few weeks. In fact, I completed units 4, 5, 6, 7, 8, 9, and 10 in the last 3 or 4 days and submitted my last quiz a few hours before today's midnight (at my local time, UTC-0300), and I still got a passing grade(!).

Even with my desire to finish the course, this would not have been possible if you had not graciously allowed for the deadline extension. I am sure that I may not be alone here in appreciating this extension (even though I think that I may be many standard deviations from the mean, doing the homework and R programs of 7 weeks in only 4 days).

Thank you so very much for everything (including this extension!),

Rogério Brito.

More completed MOOCs

This weekend I received my 17th certification (or Statement of Accomplishment) for MOOCs. In particular, this last MOOC that I took was Creative, Serious and Playful Science of Android Apps, offered by Lawrence Angrave of University of Illinois at Urbana-Champaign.

While I certainly knew that the course was an Introduction to Programming, the reasons I took it were twofold:

  1. For the "novelty" (for me, at least) of writing some Android Apps
  2. To reacquaint myself with Java, which I have not touched since 1999, when I wrote a compiler, in the ages of JDK 1.0 being transitioned to JDK 1.1.

I think that the time the course took (and which I invested in) was really well spent. In fact, I learned some nice things which I would not have touched otherwise. For instance, I made the conscious effort to use an IDE (Eclipse), despite being a person that would do most of the things inside Emacs and compile the programs with command-line tools.

Despite being huge, Eclipse has some really nice features and the auto-completion is amazing. Since Java is so strongly typed and Eclipse knows Java pretty well, it almost completely writes your programs for you. :)

Of course, my interest also made me want to streamline the build process and use the command line tools (with ant, another tool that was "alien" to me, having used makefiles for everything that I needed).

The use of Eclipse (and the Android development environment in general) in my Debian unstable computers was not without problems, but after spending more time trying to fix Eclipse than really doing the homework (and pulling the little hair that I still have left), one really helpful post nailed it. (Yes, I had to remove other packages like gimp, but such is life).

While playing with the Android side of things, I put one badly written (and not really "android-ish") app in a github repository (which I think is aptly named "Toy Android Apps"), which served me the purpose of learning some Android-specific concepts.

But the course was helpful not only to get in touch with the tools (Eclipse, ant, running Android virtual machines with the help of kvm), but also on the Java side. I lost a bit of the prejudice against it (even though I still think it is a verbose language, especially when compared to, say Python), and I, also intend to contribute to one apache project or another that have low-hanging fruits. Actually, I have signed their Contributor License Agreement and I even had my first patch accepted!

Aside: Well, sadly SVN doesn't support different committer and author fields like git does and it seems that doing many small commits may not be the Apache way of doing things—but the important part of all this is that the code is there and there is so much more to be done.

Aside #2: After a lot of time spent converting the repository to git (which is, after all, what I use), I put a mirror of Apache's the commons graph project.

Paco de Lucia

I was super sad to know that Paco de Lucia passed a few days ago. It was shocking to know that he was so young (only 66 year old, if I am not mistaken).

To share some of his fine work with his frequent collaborators John McLaughlin and Al Di Meola to people that may not otherwise know him or his work, I offered on a Facebook post to upload a bootleg of a show of the Guitar Trio (John McLaughlin, Paco De Lucia and Al Di Meola) with the entire performance (possibly out of order, as that is a bootleg, after all) ripped from a broadcast to Dutch TV station.

This performance includes the exceptional Paco de Lucia prominently and the show has many songs from the "Friday Night in San Francisco" record, including the magnificent (IMVHO) "Mediterranean Sundance".

The credits of the video are, of course, of the musicians.

I only ripped, deinterlaced (remember, analog TV—it was interlaced), denoised, encoded, and uploaded the video to youtube.

Trivial fact #3: Continued fractions via matrices

Note

Apparently, it's not Debian planet that has problems to deal with MathJax (which makes posts usingn it appear illegible on Debian planet), but the ikiwiki plugin that I am using which generates garbage in the feeds that get consumed by planet.debian.org.

If you it know of a better plugin (which doesn't generate such output) please let me know, especially if it is not super computationally expensive for a Core 2 Duo T7250 (which is my notebook).

The real thing

As William Stein is now offering a course on Number Theory and he has been posting recorded videos of his lectures, I started watching some of them (mainly, the ones regarding continued factions). In particular, he shows the usual recursive formula for convergents of a continued fraction and that's super nice.

For the curious reader, the recurrence relations for the convergents of \([a_0, a_1, \ldots, a_n, \ldots]\) are: \[ p_n = a_n p_{n-1} + p_{n-2}\\ q_n = a_n q_{n-1} + q_{n-2}, \] with initial conditions (\(p_{-2} = 0\), \(p_{-1} = 1\), \(p_0 = a_0\), \(q_{-2} = 1\), \(q_{-1} = 0\), \(q_0 = 1\)).

He even motivated the use of continued fractions with the golden ratio, which is super nice, given that I like the subject and have been writing a document collecting facts that I know about the Fibonacci numbers (well, this document is horribly incomplete and not even close to something that I would consider proper for public consumption—I plan on publishing them soon).

OK, after he discussed the basics of it convergents, he noted that the recurrence relations are like those defining the Fibonacci numbers, just that one of the terms is "weighted" by the coefficients of the continued fraction.

I missed one thing, though: neither his book nor wikipedia's article on continued fractions mentions a very neat, alternative way to express the convergents of continued fractions (truth be said, I took a quick peek at wikipedia's article and I found that it doesn't mentionat least explicitly, in a 1 min skim—it may well be buried somewhere else).

I, then, proposed the following exercise for his students (which he apparently liked, as he +1's the suggestion):

Prove that the recurrence relation of the \(p_i\)'s and \(q_i\)'s that we mentioned before can be obtained via matrix multiplication. More precisely, prove that: \[ \begin{bmatrix} a_0 & 1\\ 1 & 0 \end{bmatrix} \begin{bmatrix} a_1 & 1\\ 1 & 0 \end{bmatrix} \cdots \begin{bmatrix} a_n & 1\\ 1 & 0 \end{bmatrix} = \begin{bmatrix} p_n & p_{n-1}\\ q_n & q_{n-1} \end{bmatrix}. \] As a corollary, derive Cassini's identity for the Fibonacci Numbers.


Please, if you any errors on this, please let me know so that I can fix it.

Edit: Thanks "noone" for spotting a typographical error.

Some new Youtube-dl functionality

I wrote in a previous post that Youtube changed their way of delivering videos, with the use of Dynamic Adaptive Streaming over HTTP. On top of that, they started serving both the audio and the video in separate streams, which meant trouble for downloader tools like youtube-dl.

As I mentioned in that previous post:

What does this mean in practical terms for users of youtube-dl? Well, if you wanted to download videos in resolutions like the 480p (format 35) that I mentioned, then you will probably have to change your way of doing things, until a more automated solution is in place.

You will have to download both the audio and the video and, then, "combine" them (that is, multiplex them) to create one "normal" video file with both the audio and the video.

And latter, I wrote:

Otherwise, to download 480p videos (which I do for lectures and so on with other projects of mine, like edx-dl) I have to call youtube-dl twice: once for format 135 and another for format 140, since the old (?) format 35 files are much smaller than the lower resolution 360p files (due to the former being encoded in High profile vs. the latter being encoded in Constrained Baseline profile).

(...)

Well, now, we don't have this problem anymore:

The new release of youtube-dl brings us many goodies, including that it is possible to automatically combine/merge/multiplex audio and video formats that Youtube now offers separately (See the previous comments about separate audio and video).

Now, if you want a 480p video in H.264 format, High profile, with 128kbps AAC audio (this used to be Youtube's format 35), you can specify format -f 135+140 on the command-line and it will download both the audio, the video and multiplex it with ffmpeg (or avconv, depending on what you have installed).

Besides being convenient, this automatic downloading and merging makes it unnecessary to write scripts to, say, retrieve all the videos in a playlist that contains a lot of lectures (see one example here, taught by Benedict Gross).

Version 2014.02.17 of youtube-dl will be soon in your favorite mirror of the Debian archives.

Miscellanea

Init system

I am happy that Debian's Technical Committee was able to reach a very good decision with the choice of systemd as the default init system for the Linux ports.

In particular, I was very pleasantly surprised by the lucidity of Russ Allbery's analyses and, above all, his patience even with people that, in my understanding, were just being trolls or trying to cause confusion and disrupt a process that was already chaotic.

Russ, thank you for your exemplary role.

(Also, you type fast or what? You fill whole e-mails seemingly faster than I can read!)

Unexpected union of forces

It seems to me that Mark Shutterworth's decision of (perhaps?) embracing Debian's decision in the init system is a good thing.

I think that one way where Ubuntu did a great thing with respect to Debian was their initial attitude towards cherry-picking the best of breed of Free Software from Debian's archive and polishing that to create a really "for human beings" operating system. I hope that this means that, at least in part, money (Mark's?) can be channeled in the best of the interests so that we all gain from it.

(And, to be honest, I see no problem in making money with Free Software and I welcome a well done integration job).

More polarizing things

I feel that, now, I may even have some hope of sending another potentially polarizing bug to the Tech Ctte (that is ffmpeg vs. libav) and have faith that some rational decision would be made.

I was not even aware of it, but the ffmpeg folks even featured a news item in their page of me having filed the RFP bug mentioned above.

DebConf 14

If everything goes as expected, I hope to attend this year's DebConf. In fact, I am anticipating this, since I would love to, e.g., talk with Joey Hess and discuss a better way of having GNU parallel and the parallel program in moreutils have amicable co-installability among many, many other things with respect to my packages.

And, quite probably, work on packaging ffmpeg with other interested people (which seem to be numerous).


Edit: Fix markdown link pointing twice to the ffmpeg site. Thanks Marius Gedminas for pointing this out.