Data release (blogwar or pillow fight?)

Since Mike the Mad Biologist appears to have called me out on my post about data release just before he entered the witness relocation program, I thought I might articulate my position a little more clearly. Maybe he'll join in the conversation as well, we'll see.

IMO, the sequencing centers and those making decisions on how data should be released are A) considering only one model of scientist (the NIH R01 investigator) and B) making these decisions based on 5 years ago.

As for A, I doubt that is going to change based on the money behind the R01 game. That said, I think it is useful to be aware that not everyone is looking to resequence the same shit over and over looking for SNPs that might mean something. There is quite a bit of work being put into sequencing novel things that are not at all related to other sequenced organisms. It is these projects that face the biggest challenges when it comes to instant data release, as I talked about on Mike's blog. They require more work to assemble, more work to annotate and are arguably more interesting to potential data poachers. If you have, say, a critical cell adhesion protein you have identified in sloths and sea squirts that you think has to do with the development of multicellularity, how fast are you going to go look in the kelp genome, knowing it is an independent evolutionary line of multicellularity? If someone sequences yet another species of sloth, you wouldn't be as interested because you already have the gene spanning sea squirts to sloths, right?

With regards to B, I don't have to tell you that genomics is rapidly changing to the point where sequencing a genome in one's own lab is not only feasible, but in some cases easier than farming it out. If you talk to most people involved in genome sequencing at a major center, you might come away with a different definition of 'free data' than you had previously.

Getting sequencing done for 'free' is a big deal, but not as big as it used to be. At the same time, there are more capable bioinformatics people to work on this stuff than a few years ago, the software is better and the data are better. On the flip side, there are more people combing through geneomic datasets to advance their own work than ever before. There are labs that literally wait at the end of genome center pipelines waiting for data in the same way that fish swarm sewage outfalls. What then, is the motivation for people to have sequencing projects done by publicly funded centers where the data are dumped to the public right away? What will it be in two years?

This brings us back to the comment by Bob Carpenter on my previous post

I’m confused. If the data’s publicly available, what’s unscrupulous about using it? Is it just people violating licensing terms like embargoes?

The first question gets to the root of my point. People feel like anything that is public is free to use, and maybe they should. But how would you feel as the researcher who assembled a group of researchers from the community, put a proposal together, drummed up support from the community outside of your research team, produced and purified the sample to be sequenced (which is not exactly just using a Sigma kit in a LOT of cases), dealt with the administration issues that crop up along the way, pushed the project through (another aspect woefully under appreciated) the center, got your research community together once they data were in hand to make sense of it all and herded the cats to get the paper together? Would you feel some ownership, even if it was public dollars that funded the project?

Now what if you submitted the manuscript and then opened your copy of Science and saw the major finding that you centered the genome paper around has been plucked out by another group and publish in isolation? Would you say. "well, the data’s publicly available, what’s unscrupulous about using it?"

It is really easy for people to say, "well fuck those slow bastards, they should have published faster!" but let's couch this in the reality of the changing technology. If your choice is to have the sequencing done for free, but risk losing it right off the machine, OR to do it with your own funds (>$40,000) and have exclusive right to it until the paper is published, what are you going to choose? You can draw the line regarding big and small centers or projects all you want, but it is becoming increasingly fuzzy.

This is all to get back to my point that if major sequencing centers want to stay ahead of the curve, they have to have policies that are going to encourage, not discourage, investigators to use them. If you think the technology won't be available in 5 years to sequence a genome for a couple grand in individual labs, then you have your head in the sand. Without constant re-evaluation of their mission and how they need to serve the community, that kind of competition could be the meteor that forces the shift from big centers who need tons of resources to survive, to small furry independents scattered all over the world.

25 responses so far

  • Joseph says:

    [It is really easy for people to say, “well fuck those slow bastards, they should have published faster!”]

    Slow can also be a structural issue. The large, interdisciplinary team that was assembled to do a major genetics project can take time to reach consensus on a paper. Just circulating for comments can take a few weeks. A single analyst who grabs the data and puts out a paper is guarenteed to be more agile than a broad based team.

    But that model penalizes the people who put in the work to develop a project. Sure, we have all seen papers sit far longer than is right or just; if this discussion helps us rethink how long a paper should be moribund before another groups hould have a chance at the data then that is a good outcome. But the idea of turning publishing into a race seems to undercut the sort of collaborative structures that a lot of us work veyr hard to develop.

  • Markk says:

    This sounds like a straw man. If somebody takes the data from a group and pounds out a fast paper, is Science (or some other really high quality journal) really going to publish it without having the author crediting the people who generated the data? Wouldn't the academic community punish the author who without consulting with the original group publishes the paper without their input? And evidently trying to scoop them?

    This should be up to community policing. If enough people don't care then yes, the data generation will not happen in an open manner, or I would guess in reality new standards will be applied for crediting and funding data generation. I don't see the answer being, well we'll keep the data secret for a year. If you want to do that form a corporation, use private funds, and don't ask me the taxpayer to foot any of the bill.

    Now I do not think everything in the lab notebook (or raw device output) needs to be public and I think a better topic of discussion might be what standard formatting should be used to release various kinds of data. Public paid for data is public data though.

  • One way to resolve this problem will revolve around how scientists are rewarded for all of the 'pre-genome' steps you describe, especially when they're sequencing a small number (or a single) of genomes. A 'birth announcement' paper--one cited when other people use the data--would help PIs get the credit they duly deserve: it wouldn't just be "we sequenced these genomes" but it would also include some preliminary, 'canned' analyses (I view this as analogous to a paper that describes a fossil for the first time).

    As to 'data rustling', I'm not sure how to deal with that, other than the 'birth announcement.' But I think what's happening here is a larger change in how genomics does science, which is that data analysis and 'upstream' data production will become more disjunct. It will be vital for bench/clinical scientists to find informatics collaborators, or develop those capacities internally.

    Regarding the centers, you're right: they do need to add value (and I think mine tries to do that) by assisting with analysis. But I also think what the centers will be doing in five years will be different. In the last three years, we've gone from a point where ten bacterial genomes were a big deal, to where, if someone walked in today with ten, we would question if was worth our and the collaborator's time (if someone wanted two or three bacterial genomes done quickly, I would argue, if they have the money, that they should go to a private company. While the assembly and annotation probably won't be as good, the time to completion will be as fast or faster).

    Sequencing centers, in my opinion, are like battleships. They aren't quick and nimble, but when they target something, they can unload on it. Currently, a 'small' bacterial project involves dozens of strains, with hundreds as the new normal (the hard part isn't the sequencing per se, or even the library construction, both of which are automated, it's the informatics, processing, data storage, links to NCBI, and so on. Doing the latter hundreds or thousands of times reliably is the hard part).

    I think this presents an opportunity for small research groups and small sequencing centers, because they can be quick and nimble and very targeted. I don't see them as opposing each other, but complementing each other. So the big stuff (in terms of numbers, not significance) might be another future role for centers.

    p.s.-sorry about the slow response, science and life got in the way.

  • I would also add that the funding agencies need to hear these concerns. Believe it or not, we share these concerns, but they need to hear other people too.

  • proflikesubstance says:

    This sounds like a straw man. If somebody takes the data from a group and pounds out a fast paper, is Science (or some other really high quality journal) really going to publish it without having the author crediting the people who generated the data? Wouldn’t the academic community punish the author who without consulting with the original group publishes the paper without their input? And evidently trying to scoop them?

    There are multiple examples of this and you can see the original thread for people who have had this happen to them weighing in. Journals don't have the time or resources to figure out who 'owns' what or what the ethics behind the use of each piece of data are. As for the community, how exactly does this get policed? Are there not labs in your filed that people avoid collaborating with? Do they still exist? Take a lesson from ecology: every system has a small number of 'cheaters' who get by because there numbers are small and they can take advantage of the status quo.

    RE: the large centers, my personal opinion is that they need to invest more into the informatics side and become more service-oriented. By having bioinformatics people dedicated to one project, they will simultaneously move data out faster, better meet the needs of the majority of their clients (who are often biologists asking bioinformatic questions) and stay competitive when compared to a smaller center with fewer resources, because of the value-added. This could be done without losing much in the way of data generation (because the cost is dropping so rapidly) and the added speed would mean that a short data exclusive period could be implemented that would have little effect on the DNA-to-community-sequence timeline.

    if someone wanted two or three bacterial genomes done quickly, I would argue, if they have the money, that they should go to a private company. While the assembly and annotation probably won’t be as good, the time to completion will be as fast or faster

    I agree with the first part, not with the second. I appreciate that many centers have a lot invested in their bioinformatic chops, but to think that they the BESTEST EVER at assembly and annotation might have something to do with the water coolers being filler with kool-aid. There are people who know what they are doing that are not employed by a major sequencing center, and for the most part, they have access to the same software and may actually innovate a little too, in their spare time.

  • chemicalbilology says:

    These issues are a problem for the proteomics community, too. Or at least increasingly will be as the data generation/data analysis disconnections are solved. Right now, very, very few people are willing to share their proteomics data freely, and those that do are not necessarily able or willing to curate it usefully. But the data analysis burdens in proteomics are HUGE, and I think that the only way to adequately tackle all the information they could provide is by distributing it all out into the cloud for anyone to dig into.

    Think "PhD in Proteomic Bioinformatics from the University of Phoenix Online," never having to step into a lab but being able to mine into the existing dataset to explore novel hypotheses. I'd love to see that, man.

  • chemicalbilology says:

    Oh yeah, and this:

    I appreciate that many centers have a lot invested in their bioinformatic chops, but to think that they the BESTEST EVER at assembly and annotation might have something to do with the water coolers being filler with kool-aid.

    for proteomics, too...

  • I was actually referring to what you get from the private companies. Regarding the software, the newest versions usually aren't publicly available, at least not right away (e.g., we'll be releasing a public version of a new annotation pipeline in a few months), although we typically release what we have as soon as it's been debugged and can stand alone.

    We don't fill the water coolers with Kool-Aid; we fill them champagne instead....

    On a serious note, the reason a disproportionate amount of the development happens at the centers is because the developers don't "actually innovate a little too, in their spare time", but that's all they do; they're specifically funded to do this. I'm not claiming 'smarter' or BESTEST EVER, just better resourced and fortunate enough to be able to focus on specific tasks (and not be distracted by other obligations).

  • proflikesubstance says:

    Dude, I was being sarcastic. There are whole labs that are specifically funded to concentrate on sequence analysis methods.

  • hapsci says:

    Interesting. I am very much in support of all data being available freely. I believe it would speed up research dramatically. I am aware however of why people are so secretive about their data. If the way in which we published papers changed, would that alter the way in which we think about data? If all research papers that were published were 'interactive' - so if you clicked on the data in the paper it would link you to the original data which would also be published online (in an e-lab book) which would have the name of the original researcher/group that did the work. Would this kind of system protect the owner of the research but also allow others to use the data/combine it with their own data and have their own ideas about it? These are some thoughts for the future maybe, I know we are a long way away from this kind of system, but I think it is a good place to head for. I am all for data release and e-lab books, I have blogged about some of my ideas about e-lab books here http://sciencehastheanswer.blogspot.com

  • hapsci says:

    Interesting post, I believe that all data s

  • hapsci says:

    oops I pressed submit a bit early there.. I believe that all data should be freely available. I understand why people want to hold onto the data they have though. Maybe if we changed the way in which we published data that might change the way people feel about holding onto it. If research were papers were published as interactive papers, for example, if you clicked on a piece of data it would link back to the original data results (which could be stored online in an e-lab book). This would help protect the researcher that obtained the original data but would also allow other groups to use other peoples pieces of data along with their own data and their own ideas. I know we are a long way away from this, but it might be something to aim for in the future. I have blogged about my ideas for e-lab books here - http://sciencehastheanswer.blogspot.net

  • Rich says:

    The embargo policy only works if journal editors follow it. I recall a paper squeaking past an embargo recently, and then being retracted (can't recall the details now). Anyway, I think the embargo policy is the best bet. Allow people free access to the data immediately, but don't let anything be published until the moving wall has passed. This appears to be working for modENCODE (one of the larger biomed collaborative projects in the works).

  • Hermitage says:

    I think it has been mentioned on this blog and many others how embargo policies can fail. I've heard too many stories of blatant data falsification/data mining from other groups that end in up in shady journal articles that promptly get published because of the last author. Journal editors have enough muckity muck to check for without having to navigate upteen different data center websites to ensure no embargo has been violated.

  • Janet D. Stemwedel says:

    Ping!

  • gnuma says:

    Here's an interesting change that is about to occur in many evolutionary type journals:
    http://www.journals.uchicago.edu/doi/abs/10.1086/650340

    Archiving data is going to provide checks and balances, new meta-analyses will be possible, some really cool skienze is going to happen. The idea behind this data archiving stems from the data archiving of GenBank -- the absolute go-to for any molecular work these days and an invaluable resource or archived data. The release of genomic sequence hot off the finisher's fingertips, however, seems a different beast then data archiving. Researchers archive their data when they have produced some type of finished product -- not before. If there is relatively little incentive to be had for producing the data, ie, first dibs at a decent paper, then I agree with PLS, the strategy would be to produce in-house rather then open yourself up to being scooped.

  • I'm late to this issue, but it's nothing new really. Almost immediate data release (within 24 hours of assembly) on major sequencing projects was settled upon by sequencing centers back in the 1990's with the Bermuda Accord.

  • Mitzi Morris says:

    "There are labs that literally wait at the end of genome center pipelines waiting for data in the same way that fish swarm sewage outfalls. "

    Please tell us the name of these labs. I would like to get in touch with them, because
    it would be awesome to find out how to crunch any old dataset without having to
    waste time talking to biologists about what they did and why.

  • Mitzi Morris says:

    First off, apologies for the snarky comment above.

    Many other people have made good points about whether sequence data in and of itself constitutes publishable results, and that data sharing and embargo policies are working pretty well for projects like modENCODE.

    I don't follow the logic of why genome sequencing centers need to change their policies. What sequencing centers are we talking about here? Big gov't centers? They've got a whole boatload of their own labs to support. If the small labs around the country started doing their own in-house sequencing, that wouldn't make much difference to them.

    As for whether or not in-house whole-genome sequencing will be cheap and ubiquitious in 5 years, who knows? I would compare the results being produced with the current crop of nextgen(sic) sequencers - Illumina et al - to what was predicted/hoped of them 5 years ago, and use that to project where we'll be in 5 years.

  • antipodean says:

    All blogwars are pillowfights. This is way more important.

  • Daniel Brown says:

    "how would you feel as the researcher who assembled a group of researchers from the community, put a proposal together, drummed up support from the community..."

    It's hard for me to feel that bad for these big name champions of the more esoteric sequencing projects simply because a large reason they are able to drum up the support and funding is that public use, the avalanche of data that will follow, and the speed at which this will happen are often central elements of the project's stated merits.

    Disclaimer: until recently I was one of those fish waiting at the end of an echinoderm sequencing pipeline.

  • James Sweet says:

    There are labs that literally wait at the end of genome center pipelines waiting for data in the same way that fish swarm sewage outfalls.

    So a "genome center pipeline" is a literal physical place, with a bunch of scientists hanging out?

    Sorry, I get your meaning, but your use of the word "literally" was, eh, pretty much exactly wrong in that sentence.

    Carry on!

  • proflikesubstance says:

    I don’t follow the logic of why genome sequencing centers need to change their policies. What sequencing centers are we talking about here? Big gov’t centers? They’ve got a whole boatload of their own labs to support. If the small labs around the country started doing their own in-house sequencing, that wouldn’t make much difference to them.

    Um, I'm pretty sure they need clients, just like everyone else. If the choice was to maintain control of your data and pay for it (Which is getting cheaper daily), or work your ass off to get it done for free and without control of it, I think more and more people are choosing the former.

    It’s hard for me to feel that bad for these big name champions of the more esoteric sequencing projects simply because a large reason they are able to drum up the support and funding is that public use, the avalanche of data that will follow, and the speed at which this will happen are often central elements of the project’s stated merits.

    Really? Speed is the only factor on which genome projects get supported? That's news to me.

    Sorry, I get your meaning, but your use of the word “literally” was, eh, pretty much exactly wrong in that sentence.

    And thanks for literally adding nothing to the discussion.

  • gnuma says:

    @Thomas Joseph, the Bermuda accord is why we have embargos in place. At least in my experience, the data were publicly available, but publishing on the data was frowned upon until the people responsible for the data generation had published. I believe PLS is talking about sequencing centers moving to a non-embargo policy.

Leave a Reply