Server space – cloud or on-campus?
Cloud storage is not the cure-all it is sometimes proposed to be. On-campus servers have some advantages in transmission speed and cost, especially as storage requirements grow considerably. Talk to your campus IT services before shelling out 90% of your R01 to Amazon to store your petabyte (PB) dataset.
Cloud storage is good for personal use. It makes switching to a new computer easy, or working on documents from many different devices, including mobile, seamless and easy. OneDrive has been working well in our hands (across OSX, iOS, Windows, Windows Phone, and Android), but we haven’t done an exhaustive comparison. It also makes setting up a new computer easy. No need to deal with documents at all. Just install the applications, the cloud client of choice, and then let it sync.
Cloud storage is also good for data sharing. It usually has good access control features and can be accessible from anywhere in the world, on multiple platforms.
However, cloud storage is not always ideal for data storage for a lab. Any cloud system is going to have a bottleneck between you and the cloud, and that bottleneck can be a significant limit. It can be significantly faster for me to download a TB of data from a server across campus, than from a server in another state. Moreover, while cloud file storage systems can be cheap for small applications (<1 TB), they can be expensive for larger data sets (e.g., ~$24k/year for 100 TB).
Instead, we use an on-campus server. The default options are sometimes competitive themselves, but it can be even cheaper if you buy the hardware, locate it in existing campus IT server space, and then pay them (campus IT) to manage it. It took some time and discussion, but our campus IT eventually realized that it’s pretty low-maintenance to administer our server. We buy the hardware, they put it online and swap out drives as they go bad. One reason it’s cheaper is because our demands are fairly low. We don’t need 99.9% uptime (that’s less than 9 hours of downtime per year), and although we do need someone on-call, we rarely use that help. Relaxing those constraints lets them set a flat fee and review it once per year. Of course, the university if also covering some of the overhead, and that helps too.
So take advantage of the IT infrastructure you already have on campus.
Also, if you can afford it, all-flash servers are probably the way to go. Pictured above is Intel’s “ruler” class of flash memory modules (SSDs) for servers. A goal is to squeeze 1 PB into 1 U of rack space.
A helpful note : I find most image data is pretty compressible. (i.e. for a 16-bit image, most pixels are << 16-bit). That compression makes data access faster and storage cheaper.
Beyond that, do you have any advice on curating data? I find the "let's keep it all in the highest data tier" tragedy, I mean strategy, is costly and unwieldy. I find getting people to put old data into cold storage (or even deleting junk data) is harder than getting volunteers to a lab cleanup day.
I think part of the problem is often nobody knows what the value of the data is anymore (esp if it didn’t end up in a paper), and it takes a lot of effort to reevaluate the value of random old data. On the other hand since postdocs and grad students don’t stick around for more than a handful of years at most, I can see why they are not incentivized to try or reduce their individual data footprint.
Agreed. Most data we get in lab is at least moderately compressible, even restricting ourselves to lossless algorithms, and some is massively compressible. Speaking of compression, many people like to keep 100% of their raw data in perpetuity. We won’t argue against that, but it’s not always necessary. For example, that 10-year old population calcium imaging paper you published? It’s probably fine to just keep the extracted fluorescence traces. I doubt anyone would raise hell if you didn’t have the raw pixel data handy. Or the 1000 hours of HD video you have for some behavior experiment you published 5 years ago; it’s probably fine to use some lossy data compression.
We currently evaluate this on a case-by-case basis. Every day, all raw data is pushed from data acquisition computers to the server. SSDs + sneakernet for faster data transfer to analysis computers within the lab. All raw data is kept through peer review. We haven’t had to delete anything significant yet, but we’re also a young-ish lab. The exceptions I can think of are: intrinsic imaging data — we keep the Fourier analysis files, not the raw pixel data (at least not in perpetuity); also video data of mouse behavior, for which we use lossy compression.
We also agree that it is good to weigh cost-versus-value of organization/curation. Of course it’s easy to argue that everything should adhere to a strict data format and be meticulously documented with extensive metadata and submitted to a database. However, there is a cost to this, no matter how slick your system might be. Instead, it is best evaluated on a case-by-case basis, and reviewed as you go. That one-off side project that lasted two weeks and never went anywhere doesn’t need to be as carefully curated as the massive, 5-year, multinational collaboration dataset you’re collecting.
That said, one should be prepared to scale up the level of curation for that one-off project. Plan ahead so that when you do need to step it up and make it NWB-compliant, you’re not making extra work for yourself. Everything needs to be carefully documented (http://labrigger.com/blog/2014/01/08/writing-it-down/), but if it is in a different format at first, that’s fine. And may actually be the most efficient way to proceed.
I agree with all of this.
After some pretty data losses, I went all-in on storage and backups. Free software available on the Synology NAS wasn’t cutting it (maybe because I bought one without an Intel chip, but mostly because of connection issues). CloudBerry Ultimate and Amazon S3 low-frequency was the fastest, most reliable option for our data. And at ~$300 per month for 27TB of compressed data it better be. It took about a month to upload everything. Backups were great, although for big restores, you just have to wait. There is no snowball option for restores last time I checked.
After about a year of that, I bought 40TB of server space at the university CS center for $8K (comes with a 5 year warranty) and dropped the S3. The downside is the data is still here in Miami. But, our university has unlimited Box cloud storage, so I have the backup NAS throwing everything into my personal Box cloud account. Cloud performance isn’t really an issue for us, just need to know if 3 independent horrible things happen, the data is still recoverable.