Since Mike the Mad Biologist appears to have called me out on my post about data release just before he entered the witness relocation program, I thought I might articulate my position a little more clearly. Maybe he'll join in the conversation as well, we'll see.
IMO, the sequencing centers and those making decisions on how data should be released are A) considering only one model of scientist (the NIH R01 investigator) and B) making these decisions based on 5 years ago.
As for A, I doubt that is going to change based on the money behind the R01 game. That said, I think it is useful to be aware that not everyone is looking to resequence the same shit over and over looking for SNPs that might mean something. There is quite a bit of work being put into sequencing novel things that are not at all related to other sequenced organisms. It is these projects that face the biggest challenges when it comes to instant data release, as I talked about on Mike's blog. They require more work to assemble, more work to annotate and are arguably more interesting to potential data poachers. If you have, say, a critical cell adhesion protein you have identified in sloths and sea squirts that you think has to do with the development of multicellularity, how fast are you going to go look in the kelp genome, knowing it is an independent evolutionary line of multicellularity? If someone sequences yet another species of sloth, you wouldn't be as interested because you already have the gene spanning sea squirts to sloths, right?
With regards to B, I don't have to tell you that genomics is rapidly changing to the point where sequencing a genome in one's own lab is not only feasible, but in some cases easier than farming it out. If you talk to most people involved in genome sequencing at a major center, you might come away with a different definition of 'free data' than you had previously.
Getting sequencing done for 'free' is a big deal, but not as big as it used to be. At the same time, there are more capable bioinformatics people to work on this stuff than a few years ago, the software is better and the data are better. On the flip side, there are more people combing through geneomic datasets to advance their own work than ever before. There are labs that literally wait at the end of genome center pipelines waiting for data in the same way that fish swarm sewage outfalls. What then, is the motivation for people to have sequencing projects done by publicly funded centers where the data are dumped to the public right away? What will it be in two years?
This brings us back to the comment by Bob Carpenter on my previous post
I’m confused. If the data’s publicly available, what’s unscrupulous about using it? Is it just people violating licensing terms like embargoes?
The first question gets to the root of my point. People feel like anything that is public is free to use, and maybe they should. But how would you feel as the researcher who assembled a group of researchers from the community, put a proposal together, drummed up support from the community outside of your research team, produced and purified the sample to be sequenced (which is not exactly just using a Sigma kit in a LOT of cases), dealt with the administration issues that crop up along the way, pushed the project through (another aspect woefully under appreciated) the center, got your research community together once they data were in hand to make sense of it all and herded the cats to get the paper together? Would you feel some ownership, even if it was public dollars that funded the project?
Now what if you submitted the manuscript and then opened your copy of Science and saw the major finding that you centered the genome paper around has been plucked out by another group and publish in isolation? Would you say. "well, the data’s publicly available, what’s unscrupulous about using it?"
It is really easy for people to say, "well fuck those slow bastards, they should have published faster!" but let's couch this in the reality of the changing technology. If your choice is to have the sequencing done for free, but risk losing it right off the machine, OR to do it with your own funds (>$40,000) and have exclusive right to it until the paper is published, what are you going to choose? You can draw the line regarding big and small centers or projects all you want, but it is becoming increasingly fuzzy.
This is all to get back to my point that if major sequencing centers want to stay ahead of the curve, they have to have policies that are going to encourage, not discourage, investigators to use them. If you think the technology won't be available in 5 years to sequence a genome for a couple grand in individual labs, then you have your head in the sand. Without constant re-evaluation of their mission and how they need to serve the community, that kind of competition could be the meteor that forces the shift from big centers who need tons of resources to survive, to small furry independents scattered all over the world.