serious

EMBL queries (NB : Monstrous ego post)

This is another one that's mainly for my own reference, later CV-building and/or self-aggrandisement.

Prompted by Liam's post, I've been having a quick shufti at the sequence databases - the main EMBL[1] database in particular. All my work from the Sanger's in there, which I think is around 200 database submissions. Time was I had internal access to the tracking databases and could just wave a little homebrewed piece of SQL across them to find out what my running totals were, but those days are long gone now. I'd like to have a list of my finished projects, though, together with figures on what the total coverage was, which organisms and chromosomes they were on, and the like.

After weeding out the contributions of a GC Clark from Tigger (who used to have the most risible logo for any scientific institution anywhere ever - a tiger climbing a double-helix[2]) it comes down to 257 - slightly more than I was expecting, but almost plausible. They all, so far, seem to be from the right organisms - mouse, zebrafish and human - and in the right timeframe - '97 to '03 - so maybe I really am a 25% better finisher than I'd been assuming. It's slightly unnerving that I can still recognise a fair number of the clone-numbers, and some features of the odder ones. Someone at the EBI seems to have been using the EMBL header for one of them (a big bugger called bK250D10) as an example for some purpose.

Right. Some of them aren't mine - at the end there are some plant and entamoeba ones from other G Clarks. That leaves 152 human, 38 zebrafish (Danio rerio) and 42 house-mouse (Mus musculus). 232 in all. A good score, I'd say.

At some point, I should get them locally in machine-readable form, tot up totals and find out what percentage of the human genome I had editorial responsibility for. Less than one percent, certainly, but is it a half of that or a third?

After some preliminary adding of the human projects, about 14 885 000 bases. Pushing half a percent. The infamous Alan Tracey managed twice that, but we're not all superhuman.

Supplementary : I am reminded that the project aimed to sequence euchromatic areas only - according to this paper from last year[3] that comes to (build 35) 2.85 gigabases, which makes my contribution to the finishing 0.522%, roughly. I think "half a percent" could end up being my epitaph. There's a page here about the extent to which the Project achieved its goals.

Bloc Party : seal of approval. Thanks, venta. Good call.

[1] EMBL : the European Molecular Biology Laboratory. Based in Heidelberg, it has a small number of outstations including one at Hinxton, next to the Sanger. They maintain one of the largest and most complete sequence databases available. As it was (and presumably still is) the primary deposit database for the Sanger, it's the obvious place for me to be looking for Sanger data.
[2] No that there's any lingering animus, of course. Dearie me no.
[3] Yeah, yeah. One of mine. See very long list of authors for details. Someone might want to point out to Mr Greenhalgh that he's on it too, if anyone sees him.


I'm in danger of turning into Walter Sobchak, so I'll shut up now. Hopefully for a very long time.

Tank Girl was on the screen at Neon, and I got to wondering who Tank Girl's sidekick was played by, as she looked familiar. Naomi Watts, apparently. Who was in Mulholland Drive. Except in TG she's not even vaguely blonde. Or glam.
  • Current Mood: pleased
  • Current Music: Bloc Party - Luno
I should be able to help you if you need more details. Email me.
That's surely a pretty good claim to fame though. Not that I understand biological things well. But, the mapping of the human genome is one of the great scientific achievements of the last hundred years.
Thank you. We're very proud of it. I have the Tshirt somewhere, but I don't wear it very often.
The infamous Alan Tracey

What, not THE Alan Tracey... erstewhile pilot of Thunderbird 3?