Waiting for the jobs to finish

Thoughts on science and tips for researchers who use computers

Reading list

March 21, 2012 — Carles Fenollosa

Recently I am stumbling upon very interesting articles, the kind that take you half an hour to read but are worth every spent second. If you're like me, they are the perfect reads for those times when the computer is compiling, running unit tests or waiting for the jobs to finish.

Initially, I wanted to write a short review of each and post it here, but that might not be very appropriate, as I don't want to convert this blog into a tumblr-like site, with random content. However, I use a very cool service named Readability which lets you save articles for reading them later, create favorite lists and automatically provides an RSS feed of each, which makes it very convenient to use with Google Reader.

That lead me to create a favorite articles list with links to long, insightful walls of text covering topics like computing, science, startups and the internet, which I want to share with you. I curate content from sources like Reddit or Hacker news, so if you like what you read, I would recommend you to also check them out.

If you're into RSS, please subscribe to its feed, I promise not to overload you with references.

Should you want to share a link with me, feel free to drop me an email or send me a tweet. I am always on the look for insightful articles.

Comments?  

SQLite: a standalone database for your application

November 10, 2011 — Carles Fenollosa

Us researchers are used to store data in plain text formats, because it's very easy to parse and work with. While this is appropriate for some data types—and, I'd add, very useful to send later to R—, in some cases disk access is slow or just inefficient.

This topic is actually very important for some projects, as storing records into a plain text file is very slow to query afterwards. And this is the key question to ask ourselves before considering to use a database. Databases are great for complex, unordered queries, but not so great for sequential access of raw data. Let's see an example.

There is a data file which stores atom coordinates, for example, from a Molecular Dynamics simulation. This data is very likely to be read once, sequentially, then processed in memory. The information represents a matrix which will be processed by mathematic functions. This is the classic example where data files (either binary of plain text) are used correctly.

But now let's think of a list of proteins and some properties, for example, molecular weight and number of available 3D structures. All these objects are independent, they have their own entity. While you can store a text file with one line per <protein, weight, structures>, it makes more sense to store it on a database.

Databases allow complex queries to be resolved very quickly. For example, give me all proteins with molecular weight > 50,000 , list all proteins which have no crystal structures, or print all the proteins which have duplicate structures. Were we working with a text file, we would need to process it completely every time we perform a query. That's very, very slow. Databases internally store the information in such a way that queries don't need to go through all elements to get the answer. Namely, they store data on trees by indices.

How do indices work? It's a complex issue but let's think of a very basic example. Let's say you have three protein structures (1BCD, 2KI5, 1AGI) which you want to index by name and molecular weight. The system will then automatically build a protein binary tree where 1BCD is the parent, the left child is 1AGI and 2KI5 is the right child. Then, it will create another tree where the left child is the lightest protein, the parent is the middle one, and the right child is the heaviest one.

If the index tree is always sorted where the left child is alphabetically inferior than the parent, and the right child is always superior to the parent, then we can access any element or group of elements not only without checking every item but also in logarithmic time. Databases do this once for every index you configure, so complex queries can be solved super fast because for each of them the system only needs to process a few items of the many millions you might have stored in the DB. That's because every time you jump to a child element, the system is avoiding to process half of the database, then a half of this half (1/4), then 1/8, etc.

To summarize, if you have some data where each record has its own entity (i.e. can be thought of as an "object") and you think you'll make queries which retrieve an arbitrary number of the elements, then you need to use a database. Databases have even more advantages, like relationships between objects (e.g. each crystal structure has its own entity, and can be related to a protein), but database design is a complex topic and this article will cover only the basic data storage.

However, databases are usually configured by the system administrator and handled by a daemon—oracle, mysql, postgresql. Here I will talk about yet another way of creating databases, without the need to start any daemons, have any user privileges and, more importantly, easily portable. This is done via sqlite.

SQLite is a library that implements a SQL engine inside your own application. This means that while the database is persistent inside a file, all the querying infrastructure is deployed along with your code and stopped when the code finishes running. The databases can be created very easily, making it easy to have multiple DBs for testing, and without the need to bother the system administrator.

sqlite has bindings for almost all popular languages and also a commandline interface which is handy for testing and debugging. The data is stored on a single file which can be deployed with your application without needing to install any standalone servers. Obviously, it is not a replacement for Oracle's solutions, but it can speed up a lot some applications which need to work querying data and don't have access to one.

Most popular software uses some kind of database to store data, as this is a super fast way to access preferences and other items. For scientific programs, it is always necessary to think twice before using one, as database design is an art on its own, and as said before, it does not suit all needs.

When used properly, a small ad-hoc database like sqlite can speed up software, make data access very easy and allow the manipulation of large, objectified, in-related data collections with simple queries instead of writing long and slow algorithms which process all the data when you only need one item.

Comments?  

Cloud tools: Google Docs is now good enough

October 26, 2011 — Carles Fenollosa

We've recently been bombarded with the concept of "the cloud". Every company is using it now as an advertising term, so it's losing a bit of its semantic value. But what does it really mean?

Cloud computing means a lot of things, from distributed computing to distributed storage, ubiquitous documents and collaborative work. Basically, if your data is stored somewhere on the net, you can access it from everywhere and, optionally, work simultaneously with collaborators, then it's "on the cloud".

There are many cloud applications, but today I want to bring some attention to one which is mature enough for everyday use: Google Docs. It's so good that I've been using it for many years now, and there have been only a few occasions where I had to go back to use a local software to create a document. Those have been very specific cases where I needed to produce high-quality leaflets or presentations which need professional software.

Talking about professional, the quality of GDocs is, at least, good enough. At most, it is excellent. The spreadsheets is probably the most basic product, but the documents and presentations are great and they can perfectly be used for any serious work.

However, the highest advantage of using a cloud product is collaboration. There are some "revision" tools for MS Word which are good, but nothing beats collective editing of a document at the same time and rolling back to a previous version like a wiki. Collaborative document edition is definitely Google Docs's strongest point.

Obviously, there are drawbacks. The biggest ones are the inability to access your data when there is no internet or a server error—Google is known for its stability, but it sometimes crashes, believe me— and the lack of features compared to a desktop solution like Libreoffice/Openoffice or MS Office. Quick tip: remember to activate Google Doc's "offline access" to get a local copy of your documents even if the internet goes out.

A controverted point is that of privacy. Some institutions force workers to use local tools because of intellectual property or industrial secret reasons. If that's the case, end of the story, but keep in mind that once a document leaves the local mail servers all privacy is lost, e.g. one of the collaborators has a different email server like gmail or those of another University.

In a scientific environment it is still common to send documents to different people by email, then merging the changes manually and usually losing some of the revisions because of the document mess. That needs to stop. While the final version will probably be edited with a desktop software, there is no reason for a manuscript not to be produced with collaborative, "cloud" tools.

My advice here would be to give Google Docs a try, because it has changed a lot since its inception and it is nowadays an excellent editor, a backup solution, an ubiquitous server and a collaborative platform which will save you a lot of time and hassles.

Comments?  

Which is the best programming language?

September 23, 2011 — Carles Fenollosa

This classic question from beginners who start coding their own tools for research has only one correct answer: it depends.

If there is no language which is clearly better than the others, why do I have this very simple table on my Unix section?

  1. Bash, awk are your first choice
  2. PHP if you require more power
  3. C only if you know why
  4. Use Java & Eclipse

Don't get me wrong, this table has been compiled from quite a few years of experience, and there are huge assumptions behind it. The first one is that you still don't know what language you should use. If there is no doubt, either because there are some requirements for the project or because there is a language that is designed specifically for that task (e.g. CLIPS for declarative programming), then you need to ask yourself some questions.

I'll start by enumerating the most popular language choices and some of their features

Scripting languages

Scripting languages are good for small or medium projects, because they fit very well with the line of thought of the programmer. This means you can program while you think, which isn't the best for cleanliness, but it gets the work done quickly

  • bash is always your first choice. You already know how to run stuff in the command line, right? So this is basically the same. bash can handle functions and arrays, but that's pretty much all it can do. However, that's usually good enough for small routines, and you can always call other Unix tools. It also avoids the overhead of running another binary (perl, php) as it is already in memory.
  • perl is great to parse text, but slow for anything else. If you need to parse text and do math, use php, which has a faster math engine. It also lacks objects. However, there are very good scientific libraries for it, so you might be forced to use this language anyway.
  • php has nice libraries to connect to databases and in general do web stuff. It is also object oriented, so it is a suitable candidate for small-medium projects which can benefit from object orientation but don't need all the infrastructure from java or C++. In general, unless you are tied to perl, php is a better choice.
  • python is, well, another scripting language. It's way better than perl, and functionally similar to php, so you might want to use it if you like its clean syntax or need to call other python libraries.
  • ruby is so painfully slow that you should avoid it at any costs. I am including it here only to warn other people against using it.

Compiled languages

Once you start compiling code, things get complicated. However, the results are usually great, fast, and very maintainable. Let's discuss the alternatives.

  • java why java first? Because it's the most appropriate. It has great developing tools (Eclipse), it checks a lot of stuff in compile time, it does not need that the programmer uses pointers—it uses pointers internally, but transparently—and in general is a modern, object-oriented language, which doesn't require legacy stuff like headers. Yes, it is a bit slower than pure C, but the latest versions of the java virtual machine compile to machine code on runtime and achieve great performance. Most computer scientists have mastered it and in general it is widely extended. It is versatile and can be used from simple routines to implement web pages with JSP, to CRMs. Yes, I like java.
  • C is the mother of all programming languages, but this does not mean that it's the best one. It's old, doesn't have objects, and for every byte optimization which earns 1 second of execution time, the programmer needs to waste ten minutes. Optimizations should be done at the compiler level, not the code level. However, C has great compilers, from the good-enough gcc to the awesome icc.
    My recommendation is that you use it only if you know what you're doing. It's awful to parse strings in C, it lacks many scientific libraries compared to perl or java—except math functions, but that's what R is for—and the segmentation faults in general can make you waste several days looking into the code because you declared a variable incorrectly and you have a pagination issue.
    Some might argue that C is as good as the programmer is, but honestly, it makes good programmers waste a lot of time because of small issues.
  • C++ is the alternative if you need to use C in an object-oriented environment. The compiler is also able to run more checks in compile time, so you'll waste less time, but I'd go for java anyway. There is no reason to choose C++ a priori but execution speed
  • objective-C Apple users are sometimes tempted to write obj-C code, because of the excellent development tools on a Mac, but keep in mind that probably there is nobody else who can look at that code afterwards and understand it, because almost nobody uses obj-C. So I'd suggest not to use it unless you're in a hardcore Mac environment or are planning to develop a GUI for a Mac afterwards.
  • fortran there are only two kinds of people who use fortran: physicists and the poor fellows who have to maintain their code afterwards. It was designed for the 50s computers, which means that using it nowadays would be like using steering wheels from John Ford on today's cars. It is easier to understand f2c generated code than the original one. There is not a single reason to use fortran. If you need raw speed use C. If you want to write unmaintainable code, well, use obfuscated perl.

Choosing a language

Now for the difficult task of choosing a language. If you look again at the four items on top of this post, having read the language descriptions above, you might start to see what's going on. This is basically a matter of choosing the right language for your specific task, with some decisions.

Which are my time constraints? Beginner programmers often forget that, for homemade software, the total time constraint is the time you spend programming plus the time you spend running it. If the routine is expected to take 10 minutes, don't waste two hours writing a C program with pointers, write a simple script.

Can I solve it with a simple script? If the answer is yes, use bash. It's a great scripting language, and you can build on top of other Unix tools, like awk, sed, etc.

However, keep in mind that every time you call an external program, the system needs to fork() and, for large loops, this can be a huge overhead. Be rational, and think again of the execution time. Instead to launch 10,000 sed to parse lines, it might be better to write a php script, which is more powerful than bash, it won't need to fork() and the code will probably be simpler.

Will I need to maintain it or reuse the code? Will the code grow? If you think this code can be reused as a library, or integrated into other modules, think of making it into an actual C library or java class. Running scripts within scripts within a big project is generally a bad idea. And please keep in mind that, in a research environment, at some point another person will need to look at your code, so besides writing clean and understandable code, try not to use obscure languages or tools which only you know of.

Do I need to achieve the maximum speed and/or optimizations? Keeping in mind that the latest versions of the java virtual machine are pretty fast, yes, the winner here is C. But we're talking about software which can take you two weeks to code, and which would take months to run if written in perl, but only takes three hours when coded in C. When this happens, choose C.

Will it need to run on different platforms? java and the scripting languages are the only ones which guarantee perfect execution on every environment: Windows, Mac, Linux, Solaris, BSD and others. C can be compiled in different architectures but it's sometimes hard to replace mmap's on Windows or compile against different versions of the libc in different flavors of Linux.

Summary

Let's review the four initial points again.

  1. bash is a great initial choice for small projects which will take about 20 minutes to run and you don't want to waste three hours programming them
  2. php is appropriate for medium projects, which use objects, parse text and do math. perl is another good choice at this point.
  3. C is better left for experts or people who need hardcore optimizations. The rest of us will leave optimizations for the compiler/interpreter and just try to write good code which runs in O(n) if possible.
  4. java is the king of tools and libraries, multi-platform, scales great for big projects, is surprisingly fast and very respectful with novices. Its only drawback is the need of a java virtual machine, but hey, if you use perl you will need its libraries installed, too.

In the end, everyone has their preferred languages, which is fine. It is far more important to write good code than it is to choose the language which fits best for a task. However, failing to foresee the importance of a math routine in the and writing it in perl can lead to the whole research group wasting time until somebody else writes it in C and makes it 1000x faster. Yep, true story. So choose wisely.

Comments?  

The advantages of using 'screen'

September 22, 2011 — Carles Fenollosa

There's a reason I posted a screen cheat sheet on my home page, and it's because I use screen so much that I actually needed one for myself.

screen is a tool which isn't suited for all audiences. Its home page states that it is a "window manager" which multiplexes terminals. That basically means it's a terminal manager, and all graphical environments have terminal managers, right? So why do we need another one which has complicated keybindings and doesn't let us use the mouse?

Well, each one might have their reasons, but for scientists there is one very clear advantage: working on the same shells from different terminals. What? Let's look at an example:

You usually have a computer from which you work every day. From that computer you ssh into different servers, clusters, etc. So, in the end, most of us have a monitor full of xterms—or any other terminal emulator. That's fine. But then we arrive home and want to check if that job we launched has crashed, because we need the results tomorrow morning. So what do we do? We ssh into our machine, or the server, open another session (that's the key here) and run some commands.

The problem is, you already had your sessions configured, maybe even set some environment variables, had the history full of commands and the terminal buffer with some output, and now you can't recover that. How about if you could restore the same sessions and windows we had at our work computer without the need to open a VNC or any other graphical screen-sharing tools? That's what screen does.

To put it plainly, screen is a daemon that handles terminal connections, which means that you can always request the same session that you had open before. Obviously it can't manage sessions unless you open them via screen, so it's a good idea to always run your terminals through it, just in case you might need to access them afterwards.

Besides ubiquitous access to your sessions, screen also provides other interesting features that make it useful even if you don't need terminal multiplexing. The first is notifications for terminals which are in background, for example, a visual message appears in your screen if there is a new output in a hidden window. This is very appropriate to avoid constantly checking minimized windows.

It also provides window managing, i.e. many terminals in a single window, and quick switching. Finally, even on today's GUI world, sometimes you need to copy&paste without a mouse, backwards buffer management, perform text "screenshots", etc.

I'll end up with a quick tutorial by an example that actually happened yesterday.

  • carlesfe@work:~$ screen
  • [screen opens, and gives me a shell]
  • carlesfe@work:~$ ssh server.com
  • carlesfe@server:~$ work on some commands... run scripts...
  • carlesfe@server:~$ launch_a_job.sh
  • [output is being generated]

Then I left work and went home. Notice that the output is still being generated, but only in the window that I had opened at work. Usually there are some other ways to check if the job is still running, but what if it crashes? At home I wouldn't normally be able to recover the error message that was printed on the screen. But with screen session management, I can do this:

  • carlesfe@home:~$ ssh work
  • carlesfe@work:~$ screen -x

And voilà! I can see the exact same window with the same session I had at work. Now I can monitor what's happening, relaunch the command easily if it crashed, and see any error messages that were output on the screen.

I started by saying that screen isn't a very user-friendly tool, but if you feel comfortable with many of the Unix tools out there, you'll get used to it very quickly. Check out this tutorial for a beginner's introduction and give it a try for some days. Once you get used to the basic C-a keystrokes and terminal management, you'll wonder why didn't you discover it years ago.

Comments?  

Creating a simple blog system with a 500-line bash script

September 20, 2011 — Carles Fenollosa

One of the drawbacks of data analysis is that sometimes there is a task that blocks you and you need to wait for a couple of days to continue working on that project. Yes, there are more projects, and there's always papers to read, but it's fun to start small projects that can be finished in an afternoon—let me clarify: a long afternoon.

So here it is, BashBlog v1.0, a Bash script that handles blog posting. It features:

  • Simple editing of the posts with your favorite text editor
  • Transformation of every post to its own html page, using the title as the URL
  • Generation of an index.html file with the latest 10 posts
  • Generation of an RSS file! Blog's magic is the RSS file, isn't it...?
  • Generation of a page with all posts, to solve the index.html pagination problem
  • Rebuilding the index files without the need to create a new entry
  • Google Analytics support
  • Auto-generated CSS support
  • Headers, footers, and in general everything that a well-structured html file has
  • xhtml validation, CSS validation, RSS validation by the w3c
  • Everything contained in a single 500-line bash script!
  • A simple but nice and readable design, with nothing but the blog posts

What doesn't it support? Of course, comments. Comments would greatly increase the code complexity, would require a strong antispam system to avoid flooding and might pose a security issue on the server. So sorry, but no comments, at least until I have another free afternoon waiting for some computations to finish.

I wanted to post something to celebrate the grand opening, and what's better than a meta post explaining the script that generates this blog?

In the future I'll post some scientific tips or tutorials which don't really fit on the home page and maybe some rants about my work. To end up, if you are interested in bioinformatics or general science computing, check the unix section of my home page for more tips, tutorials, FAQs and cheat sheets.

I won't post with any periodicity, so you can always subscribe to the RSS feed to get the updates delivered to you when they're generated.

Hope to have you around sometime!

Edit: If you use the script for your own blog, please send me an email with your suggestions or drop me a comment on Twitter! It will be much appreciated and, until comments are implemented, the only way to provide feedback

Comments?