The ethics of data release

I was recently talking to a colleague who is in a field of genomics and we got on the topic of data release policies and I learned something interesting that I didn't know the sharing of genomic data: almost all major genomics centers are going to a zero-embargo data release policy. Essentially, once the sequencing is done and the annotation has been run, the data is on the web in a searchable and downloadable format.


How many other fields put their data directly on the web before those who produced it have the opportunity  to analyze it? Now, obviously no one is going to yank a genome paper right out from under the group working on it, but what about comparative studies? What about searching out specific genes for multi-gene phylogenetics? Where is the line for what is permissible to use before the genome is published? How much of a grace period do people get with data that has gone public, but that they* paid for?

It seems to me this is a very slippery slope because every genome paper has a different focus and it is no longer Glamour Mag worthy to just describe the genome of an organism. There has to be a hook and that hook is almost always related to the interesting biology of an organism or to resolution of a broader long-standing question based on the new data from the genome. However, these are the very things that people who are not part of the genome project would be interested in once the data are released.

The colleague I was talking to had the opinion that the (in her mind) small risks on someone scooping a major theme of the resulting paper were small compared to the benefit of the data to the community, fresh off the machine. However, she is a tenured prof with an impressive CV and a name that might scare off the vultures and I wondered whether she would have the same opinion if she was untenured.

Having my data pitched onto the internet the second I had it in my own hands would make me exceedingly nervous, even if my data were on the scale of a full genome. Stories of unscrupulous researchers more than willing to snap up any data they can find abound and I have seen blatant cases of it myself. Is the genomics community and anyone who can benefit from their data just that much more principled? Somehow I find that a hard sell. And how does one make a complaint about someone else publishing your* data if it is sitting in a public database?

I will be interested to see whether there are any high-profile dust-ups over this in the near future or whether a genome really is big enough for the whole community.

*Obviously we are talking about grant-funded projects, so the money is tax payer money not any one person's. Nevertheless, someone came up with the idea and got it funded, so there is some ownership there.

29 responses so far

  • This happened to someone I know - the professor (a big swinging dick tenured HHMI professor) ended up on the phone with the journal editor, angrily explaining why the data in the competing group's paper that he had just received to review came from his very own lab, without any knowledge or request for collaboration. I believe this led to a great deal of bad blood between the professor & the journal, not surprisingly.

  • Dorothea Salo says:

    Genomics is an odd case, because early on in the human sequencing project, the Big Kahunas all got together and decided that they would all benefit if they collectively didn't play dog-in-manger games with their data. Other disciplines are only now starting to confront this question.

    It's worth pointing out that there is a continuum of data openness even now across disciplines; not all scientists are chary with their data. It's also worth pointing out that even at this early date, releasing data appears to garner more citations for any resulting papers. It's also also worth pointing out that librarians and publishers are working on ways to cite data the same way one cites papers, which allows data producers to establish primacy and get some career credit for releasing data.

    Keep an eye on the Open Notebook Science folks, too. Are they on to something?

    All that said, quite a few scientists who lean toward the "open" end of the continuum don't release data until they're pretty sure they're done with it. As rabid about open as I am, I don't see a huge problem with that. It's still an improvement over never releasing at all.

  • [...] This post was mentioned on Twitter by Gaurav Vaidya, ScientopiaBlogs. ScientopiaBlogs said: The ethics of data release [...]

  • Dr. O says:

    At least in the case of microbial genomes, the sequencing is mostly being spun out by huge institutes, and I haven't seen many papers on the genomes in the past couple of years. I think most researchers have started to view genome sequencing as more of a service and less of a research project. (In fact, you can now send your favorite bug's DNA off to other sites to get sequenced for you in a pretty short period of time, much like sequencing a gene used to be.) Since the analysis of the genome sequence takes quite a bit more time than its sequencing, and there are already so many genomes out there that aren't published on, I don't think this has been a big problem for bacteriologists.

    How this might all apply to higher organisms (and larger genomes 🙂 might vary considerably, though.

  • proflikesubstance says:

    DS - But I wouldn't be too happy if someone stole the main thunder of my genome paper, even if they credited me with producing the data.

    DrO - I think the case in eukaryotic genomics is quite different, because the number of genomes is so much smaller. An individual can analyses a prokaryotic genome, but one person just can't (at the moment) tackle a nuclear genome in a reasonable amount of time. We're not quite at the $10, 000 genome for eukaryotes just yet and the assembly is still extremely tricky, so it is not the free love land of prokaryotic genomes.

  • HennaHonu says:

    This is already a problem. A group of collaborators sequenced the genome of a new organism, publishing it on the JGI site. The primary grant writers and workers were "scooped" by a collaborator (who had very little input on the project) who published a paper covering two chapters of the primary graduate student's dissertation. S/he has taken 1.5 years longer to graduate and has lost the big impact of their dissertation. The "scooper" can be reported to JGI, but the damage is done and the penalties non-existent. There is always more to learn based on the genome, but the collaborator did not do most of the work and gets most of the credit.

  • anon says:

    Great post, this is a very funny area we're in right now. PLS, I hope this isn't the "free" genome that you waited forever to get.

    I knew this going into a project with a huge genome center and intentionally didn't give any metadata, basically just taxon1, taxon2. I knew other folks could do analyses more quickly although without any true biological knowledge, likely make some big mistakes. The data got released and I've had many requests for more information, several from Chinese informaticists, but no requests for collaborations. I had no idea anyone would be this interested in this somewhat obscure group.

  • Christina Pikas says:

    I was talking to my non-LIS member of my committee and this came up. He brought up both what Dr. O mentions (large institutes to churn out sequences) and also a frustration with people who sit on data when it could be used to save lives. His opinion was that if you weren't fast enough to generate a paper before someone who located the data and then started working on it, tough luck, and try to be quicker next time. Sounds harsh, but makes some sense.

  • Rich says:

    The general etiquette in the genomics community is that genome sequences are free for people to play with once they are made available, and they are usually made available before the genome paper is published. However, if you are analyzing the whole genome, you cannot publish your results until an embargo period has passed (often until the genome paper is published). You can use small bits of the genome sequence data and publish results on those bits (what size constitutes a small bit isn't always clear) before the genome paper comes out.

    This is not only true of genome sequences; it's also true for more functional genomics datasets. For example, the modENCODE data are freely available under and embargo until the relevant paper(s) come out.

  • I totally agree with PLS on this one. I have had this happen to me (a program officer published a review of his program using data from a review meeting, most of which was not published yet--oops) and I was LIVID. I still got my publication, and the program officer's review was not in a major journal, but coming from a small lab with fewer resources, I could have easily been scooped by someone from a big lab with an army of students and postdocs.

    Christina, your colleague must be well resourced and not familiar with what it is like to be in a small and poor lab. Sometimes, there just isn't the personnel to race with better funded labs. Attitudes like that are what make people vehemently opposed to open data schemes, which is too bad. I don't think there is a ton of data (outside of clinical trials) where delaying access to data directly impacts anyone currently alive. I would think that most people with that kind of data feel the urgency.

    I like the idea of eventually opening up my data, but since I generated it, I should get first crack at publishing it.

  • proflikesubstance says:

    Christina - I have to completely disagree here, mainly because we are talking apples and oranges. I agree that data need to be published in a timely manner, but the difference between cat herding a group of 20-70 authors, picking the stories to highlight and getting the paper out, as opposed to one person or lab cherry picking a piece of hot data for a quick story, is enormous. Like I said, there is no way for people not on the project to know what the highlights of the genome paper are, so opening the data up to that type of vulturing (even if it seems like a "small" piece of data t an outside observer) is inherently dangerous.

    Another problem I see is that this would severely impact the high-profile (first two or three authors) use of grad students on some projects. If there is enormous pressure to get the paper out before you get scoop, will PIs be willing to spend the time to train the students to do this kind of work or will they depend on postdocs or themselves to get it done in time?

    Finally, the sheer volume of data these projects generate means that it takes time to sort through it all and make a good story come together. What are the chances that the data will be as thoroughly analyzed for the genome paper in the competitive model of an open data system? I would argue that the result will be more shallow analysis and more authors in the interest of everyone doing their small(er) part quickly.

    Papers should not be allowed to be published without the data becoming available to the public (all the data and analyses, not just the raw sequence reads), but I think it is a huge disadvantage to early-career faculty, lab trainees and smaller labs to force genome data to be public before the paper.

  • proflikesubstance says:

    You can use small bits of the genome sequence data and publish results on those bits (what size constitutes a small bit isn’t always clear) before the genome paper comes out.

    I think this is exactly my point. Without knowing what the group who produced the data are going to highlight, what might seem like a small bit to one person on the outside cold be the linchpin to someone's analysis on the inside. Since "general etiquette" can be highly subjective, so can what people pilfer from an open database.

  • Joseph says:

    One compromise that has made a lot of sense to me is an embargo period (which I see in epidemiology genetics projects) of typically one year. This avoids the race to scoop the original group (which is what Christina's approach would result in) but means that the paper(s) need to be out in a timely manner.

    It's true that it is unethical to not publish data with a public health impact, but it's not going to improve matters if the path to success becomes pouncing on data instead of generating data, either.

  • proflikesubstance says:

    But who enforces that embargo? Unless you mean that the data are held in embargo until either the paper is out or the year is up. I could see that working, but having the data public with the agreement that no one will publish anything with it for a year is just asking for problems.

  • Joseph says:

    Well, there are two ways to enforce the embargo. One is via research ethics (deny IRB approval for papers on embargoed data as most journals won't let you publish without IRB approval). But that's admittedly clunky.

    The better approach is the one you suggest of holding the data in embargo. That was the approach that the gene whide association studies I know of did and it worked out okay. Giving the grant holder one year grace before releasing the data seems to solve the worst abuses and isn't a huge delay of the results if they should decide not a proceed with them.

  • proflikesubstance says:

    Problematically, however, most genomics center are going the opposite way for genomes funded directly to the centers. For instance, JGI went from a 6 month embargo to no embargo recently. The Broad appears to have done the same.

    From where I'm standing, it looks like a problem that is going to get worse, not better.

  • gnuma says:

    I was involved with one of these big genome papers -- there was an embargo, but at least one lab did not honor it and analyzed data and submitted papers. The journals held them, but published them in the same issue. Embargos clearly do not work, yet I also am not comfortable with early release of what I would view as 'my' data. You can argue that releasing data speeds up the field and discoveries and what not, but it could also breed a mentality of working round the clock to be the first scooper. Forget having a family, even if you take time off for a soccer game you will be behind!

    My solution would be to find a genome center that doesn't publish -- sorry to be a cynical assmunch, but there are too many bottom-feeders out there!

  • There was a high profile case last year (I blogged about it here - good discussion in the comments), within a collaborative agreement that had embargoes and other guidelines/restrictions in place.

    Description from my linked post:

    "Professor Bierut had contributed data to a shared database. The federally funded project had a publication embargo in place, so that while other researchers could access and analyse the data, the contributors would get the first shot at publication. However, a researcher at a different institute breached the terms of this agreement, and submitted a paper based on Professor Bierut's data a full six months before the embargo was due to expire"

    If someone wants to scoop you badly enough, and they have access to the data, I don't think there's much that can stop them.

  • proflikesubstance says:

    Cath, I think getting scooped from the inside is just a case of bringing a douchebag on board for a 'collaboration'. It sucks, but there is a clear violation of an agreement there and everyone can see the douchosity. What I wonder about for the genomics folks is how they deal with the poaching if the data are public. One could theoretically claim innocence or ignorance in that case. It's a very different situation.

  • Joseph says:

    Yes, but Cath's case does make it seem like the best way to enforce an embargo is to simply not share the data until the end of the time period. I completely agree with you (PLS) that posting the data in the public use database is just asking for legitimate misunderstandings.

  • Joe Hourclé says:

    Most data from NASA missions (at least the space science data I'm familiar with) is distributed as soon as it's made available, but 'available' means different things to different people. For one mission, I don't think we've seen an update from one of the instrument PI teams in over two years. (they generate & calibrate the data, then pass it along to the mission archive).

    Sometimes the data that is 'available' isn't necessarily useful. For imagers, you get away with doing difference movies to do some science with even the level 0 (raw) data, but in the case of STEREO, I think it took over a year before we (the mission archive) were getting level 2 (processed to physical units) for one of the in situ instruments.

    Currently, for NASA heliphysics missions, there's the "Rules of the Road"[1], reproduced here in case of link rot:

    1. The Principal Investigators (PI) shall make available to the science data user community (Users) the same access methods to reach the data and tools as the PI uses.

    2. The PI shall notify Users of updates to processing software and calibrations via metadata and other appropriate documentation.

    3. Users shall consult with the PI to ensure that the Users are accessing the most recent available versions of the data and analysis routines.

    4. Browse products are not intended for science analysis or publication and should not be used for those purposes without consent of the PI.

    5. Users shall acknowledge the sources of data used in all publications and reports.

    6. Users shall include in publications the information necessary to allow others to access the particular data used.

    7. Users shall transmit to the PI a copy of each manuscript that uses the PI's data upon submission of that manuscript for consideration of publication.

    8. Users are encouraged to make tools of general utility widely available to the community.

    9. Users are also encouraged to make available value-added data products. Users producing such products must notify the PI and must clearly label the product as being different from the original PI-produced data product. Producers of value-added products should contact the PI to ensure that such products are based on the most recent versions of the data and analysis routines. With mutual agreement, Users may work with the PI to enhance the instrument data processing system, by integrating their products and tools.

    10. The editors and referees of scientific journals should avail themselves of the expertise of the PI while a data set is still unfamiliar to the community, and when it is uncertain whether authors have employed the most up-to-date data and calibrations.


    Of course, even with those rules in place for SDO, at the spring AAS/SPD meeting in May, it was amazing how we'd get the PI teams telling us that the data wasn't yet in a state for people to 'do science' with, not 20 min after a session on all of the new things they had learned from SDO. Even now, the data being released is only officially 'test data'.


  • I agree with a publish the paper or 1 year which ever comes first embargo. Don't put it out there and ask people not to publish the data or at least credit you. That's like sticking a full cookie jar in front of a kid and asking them not to eat a cookie until tomorrow. In an ideal world it would work, but this one is inhabited by scumbags, thieves, and fools.

  • Mizumi says:

    I wonder if the reverse issue can also occur, that as a PI you would like to make some data publicly available, but grant or university policy prevents you from doing so.

  • proflikesubstance says:

    Mizumi, I can't see a case where that might happen.

  • oldcola says:

    Well, I 'll give an example for Mizumi:

    Do screening, get peptides, prepare manuscript, wait 'till university's agency decide if there is interest for IP, if so wait 'till pharmas are contacted, if so wait 'till discussions for contract are over and contract signed, wait for patent request to be prepared… (for EPO priority).

  • I'm confused. If the data's publicly available, what's unscrupulous about using it? Is it just people violating licensing terms like embargoes?

    Mizumi: This happens all the time outside the NIH. DARPA's given us all kinds of data that we're not free to release, some of it produced by other project members.

  • proflikesubstance says:

    Sorry folks, I thought I heard something like a muffled cry, but then it went away as quickly as it came.

  • noddin0ff says:

    Quoting PLS, "Problematically, however, most genomics center are going the opposite way for genomes funded directly to the centers. For instance, JGI went from a 6 month embargo to no embargo recently. The Broad appears to have done the same."

    These changes, at least within the NIAID funded GSCID centers (Broad, JCVI, and U Maryland) are driven by the funding institute, i.e. NIAID and are not driven by the centers themselves.

    The bulk of the sequence generation from the centers is paid for through contracts with the explicit purpose of generating community resources, and are specifically not meant to compete with R01-like activities. Any investigators working with centers should have been fully informed by the centers as to the data release policy and the potential for poaching. There are certainly risks, but there are also benefits.

    I think the ethics of data embargo are problematic and completely worthy of discussion, but I think that the impression created through this thread that the centers are part of a problem is disingenuous.

    Centers have been operating for several years now on no embargo standards. For many years, at my institute, ABI3730 traces went straight from sequencer to NCBI before any additional processing or analysis occurred. Similarly for assemblies and annotations. In the transition to 454, Illumina, Solid, etc. there was a lull where there wasn't enough band width in the existing infrastructure to make this happen in real time, but that is getting fixed.

    Metadata, however, is still a hot topic in the embargo arena. The funding institutes would like to see all metadata (and strains) made public immediately.

    These policies, the swiftness with which they are implemented, and the scale certainly have the potential to shift how PIs approach the value of data. The policies, and the ethics, should be deeply discussed.

    You can read a copy of the NIAID GSCID data release policy as of September 2009 by following this link. Sorry it is so long...

  • [...] also see the whole making data public before it is analyzed thing raise it’s head again. While Schuster was fretting about his group’s Tasmanian [...]

Leave a Reply