At the eResearch Australasia conference in Melbourne yesterday, AARNet’s David Jericho presented a work-in-progress update for our CloudStor+ storage solution for researchers.
In 2012, AARNet embarked on the CloudStor+ project, to build a cloud based file synchronisation and sharing (“sync&share”) system using solely open source components.
18 months on, the deployment team is now wiser in the ways of large scale file systems, long‐distance TCP acceleration, and a lot of the anycast and proxy trickery that comes with it.
We’ve also learnt a thing or two about what researchers actually want to use the storage capability for; how much use it would make to be able to bind together different blocks of allocations on different systems; and how much use it would be if a user could not just store, but also easily execute science workflow on the storage nodes; as easily, preferably, as getting a web preview of a .pdf or .jpg stored there.
As we’re finding out about these additional opportunities, we’re stretching the original scope for CloudStor+ and inevitably in doing so we’re stumbling across limitations in the intended packages and products (limitations that the software vendors at times weren’t even aware of themselves). Some were easily solved, others we’re still solving.
David’s presentation focused on three topics:
The Internet has a problem.
Many of the protocols the Internet is built on were designed in the 1970s, when high speed links didn’t exist. Given this history, these protocols have served us remarkably well up to now and in the vast majority of cases, they continue to do so. However – when we start to talk about using the Internet for moving very large volumes of data from place to place over 10 Gigabit per second (Gbps), 40Gbps and 100Gbps links, these protocols (particularly Transmission Control Protocol, or TCP), start to cause some issues.
The graph below shows how most TCP implementations behave when you use them to try to move a single large data stream at high speed over a long distance, with a relatively small loss of packets along the way.
Image Source: ESNet http://fasterdata.es.net/network-tuning/tcp-issues-explained/packet-loss/
While we can overcome these problems through careful tuning, it is not practical to apply this type of tuning to the every researchers’ computer around the country! The next best solution is to move the data physically closer to those researchers and avoid these issues altogether.
So that’s precisely what CloudStor+ seeks to do, to put the data closer to the researcher, using highly tuned “proxy” nodes nearby to researchers’ computers (often in the same city or state), which can overcome the inherent issues normally associated with long, high speed links.
The number of sources of data a researcher is likely to encounter over their career is rapidly growing. This might be because we, as a community, are getting better at sharing our research data, something which has to be hugely valuable for researchers and research institutions as a whole – and should be encouraged and supported.
Again, we think CloudStor+ can help here. We are working with data providers around the sector to make it easier for researchers to access those many and varied data sets through a common interface which, as a happy side-effect, can help with some of the “long distance, high speed link” problems we talk about above as well.
Work is underway to test performance and smooth the user experience of such “storage side mountings”; at the time of writing, experiments are being conducted together with the NeCTAR VicNode at UniMelb to work out the best way to integrate compute, data and storage together to make the experience as easy as possible for researchers.
Another interesting trend we are observing across the sector is that research groups are developing reusable automated workflows to help them to ingest, process, and distribute their data and metadata in a more structured and repeatable way. We think there is value in sharing and reusing these workflows, as well as integrating them more closely with the underlying storage and compute platforms researchers are using in their day-to-day work.
Interestingly, while it is certainly possible to write plugins for some commercial and closed-source platforms, this is often more onerous than doing so for open-source platforms: hence the “open” in “open source”. The “sync&share” component of CloudStor+ is based on an open source platform called “ownCloud” and happens to have its own open APIs, which can be used to develop such plugins.
Because CloudStor+ is based on a number of open source software packages, we are finding prospective collaborations with research groups popping up in the discussions we have around the sector. One example is the Cr8it metadata packager, a development being led by the University of Western Sydney. We always look for ways to make such initiatives work well with CloudStor+ for the benefit of the sector.
Apr 15, 2020