How to: Store and Process Sensor Data Part II

Part 2 – What are the alternatives to running the data-wrangling code on my laptop?

Part 1 Recap

We showed in the previous article that even with a conservative estimate of requirements, the data for just 50 sensors (e.g. 50 smart watches), recording 4 measurements 9 (e.g. NO2, heart-rate, temperature) over 6 months at second-intervals will add up to 50GB of data. An equally moderate estimation of the data wrangling processes that might be required resulted in somewhere between 22 and 28 minutes of running time on your laptop, possibly locking it up, or even crashing it in the process. As we imagine you have plenty of other things you need to do with your laptop or workstation, we present you with some other options from Research IT.

Data Storage and HPC Options at the University

UoM Storage Options

IT Services provide 3 types of centrally-hosted and administered data storage for use by research staff and research students: One is the SharePoint drive, and the other two are hosted and managed by the UoM’s Research Data Storage Service (RDS). RDS provides 8 TB of storage at no charge for every funded project. This is regularly backed up, and any files can be recovered up to 35 days after deletion.

The two types of RDS data storage are: Common Internet File System / Server Message Block (CIFS/SMB) and Network File System (NFS). While CIFS/SMB storage (like SharePoint) can be mapped to directly on laptops, as network drives, they cannot be directly accessed from the CSF. NFS on the other hand can be accessed from the CSF directly, so that is what you will need for running jobs on the CSF.

If you need additional storage space, you can use the Connect Portal form to ask for additional space.

University HPC Options: The CSF

The UoM Computationally Intensive Research (CIR) Ecosystem provides the CSF3, a Batch-based HPC cluster, among many other services, for our researchers. Some Faculties also have access for their PGRs and UGs. There is also some limited "free at the point of use" resource available in the CSF funded by the University.

CSF3 has 21600+ CPU cores and 200 Nvidia GPUs which are used for a wide variety of work including serial and parallel (SMP) computation using one or more CPU cores; high-throughput work (running lots of copies of a job to process many datasets); work requiring large amounts of memory (RAM) or access to high-capacity (disk) dedicated local storage with fast I/O and also for work that requires GPU.

Full details of different types of storage available in CSF3 can be found on the facility website.

How Do I Use the CSF?

Fortunately, the RIT Platforms team have written some extensive and detailed documentation on using the CSF and run training courses periodically. They can be contacted using the appropriate options available in the HPC Help webpage available here:
https://ri.itservices.manchester.ac.uk/help/

We provide an overview of the main steps, along with links to the key documents that explain these tasks in more detail.

Obtain a CSF Account

The full UoM RIT instructions for obtaining an account in CSF3 can be found on the Research Platforms webpage.

In a nutshell, accounts are available to all members (PI’s, post-doctoral researcher staff and postgraduate students) of research groups. If your supervisor or the PI of your project contribute funds to the CSF, your jobs will be given a higher priority and your maximum permitted job size will be larger. If not don’t worry, as RIT still provide a more limited ‘free at the point of use’ service. Use the HPC Request/Help form to request an account. Make sure to provide all the necessary details as suggested in the above webpage.

Log Into the CSF

Instructions are available for Windows, OS-X and Linux users. You will log-in and communicate with the CSF via a Terminal using a SSH (secure shell) app, and if you are using Windows, it is currently recommended that you use the (free) MobaXterm application, which has a nice GUI as well as a terminal where you can type your CSF commands.

Logging into the CSF - detailed instructions

Port Your Code into Your CSF ‘home’ Folder

Your home folder is the default location after you have logged into the CSF and is where you should store important files that you wish to keep. This folder provides you with around 250GB to 500GB of storage space.

If your code is hosted in an online version control system such as GitHub or GitLab, you can simply ‘clone’ your code repository into your home folder, as you would when obtaining local copies of repositories on your local machine.

If you don’t use online version control, then you will need to transfer your code files/scripts into your ‘home’ directory.

File transfer - detailed instructions

Copy Your Code into Your CSF ‘scratch’ Folder

Each CSF user is also allocated a directory within the ‘scratch’ filesystem, which serves as your main workspace. This folder is the area where you will temporarily store copies of the code/scripts to be run, alongside any input and output data files used when running your application/script. You should create a separate folder inside your scratch folder and keep copies of your code/scripts (and the jobscript file that you will create shortly) inside that folder.

The scratch filesystem provides much greater storage capacity and is much faster than your home directory. You will run your jobs from this directory as it is the most suitable place for running jobs and storing files temporarily. However, the scratch space is not suitable for long-term storage of important files, as no back-ups are made of your files in your scratch directory and an automated scratch cleanup system is in place, which deletes any files that have not been read or written to in last three months.

File systems available on the CSF - detailed instructions

Ensure That the Required Software is Installed

The CSF has many widely used software applications and tools installed and you can find out which are available, how to request other software and how to install your own software packages.

CSF software - detailed information

Write Your ‘jobscript’ and Run It From the ‘scratch’ Area

To run your code on the CSF, you need to create a ‘jobscript’ file. The RI website has a 10-minute tutorial to get you started. This easy-to-follow tutorial teaches you how to create a very simple jobscript which does not call any external applications or scripts.

You will see from the tutorial that at the end of each jobscript file is a list of the commands to be executed for that job. These commands are similar to those you might run in a terminal on your local machine, to run your code scripts locally. Once you have mastered the simple example, you can write your own jobscript file that includes commands which call your own code scripts. For example, here is a CSF jobscript for an example Python application. It is a text file: ‘my_dummy_data_wrangling_job.txt’ and contains the following text:

#!/bin/bash --login
#SBATCH -p serial   
#SBATCH -t 2-0                       

# Commands
# Clean environment and set up to use the centrally installed modulefiles for anaconda
module purge
module load apps/binapps/conda/miniforge3/25.3.0

# Activate the dummy_data_test environment
# (This must already be created and have requirements.txt modules loaded into it.)
source activate dummy_data_test

# Run the app
python -m dummy_data_wrangling

Once you are happy with your jobscript file, from the terminal check you are in the same folder as the jobscript and submit it, using the sbatch command. For our example jobscript, the terminal command looks like this:

sbatch my_dummy_data_wrangling_job.txt

Should I Run Single-core or Multiple-core (parallel) Jobs?

In Part 1 we showed an example of how jobs that only use 1 CPU core can slow down or even crash your laptop (or a workstation without enough RAM installed), but it ran without any issues on the CSF. Also, if you run the job on the CSF, there’s no need to worry about using your laptop for normal, everyday tasks and experiencing things like performance lags, accidental termination of jobs, running out of battery power, etc. So even if you don’t plan to make use of multiple CPU cores, you will still see a huge benefit from using the CSF.

If you intend to make use of multiple CPU cores running in parallel, then you have 2 choices on the CSF, and you can find out more information in the links provided. You can:

Create a jobscript and submit a single parallel job that runs your code (e.g. your Python code or R Script) using the number of CPU cores and the amount of memory specified in your jobscript.
Create a jobscript and submit a ‘Job Array’ where you can specify/use the same jobscript to run multiple copies of your job that will run using different files as input.

More detailed information is available at:

Collect Your Wrangled Output Data

The above tutorial explains how to run the jobscript and shows how the output files are saved into the directory from where you ran it. Any files saved as outputs from your data wrangling script will also be saved there, treating the run folder as the root folder.

Remember to copy any results to a more stable location, as they will be deleted from your Scratch folder after 3 months and no back-ups are ever made. For example, you might save them to your CSF ‘home’ folder if they are not too large, or to your (or your research team’s) RDS folder to which you have access to or you can download them to your local machine (refer again to the file transfer instructions if required).

Summary

If you find yourself trying to pre-process datasets that are larger than expected and/or your data wrangling/analysis scripts are locking up your machine, then, unless you have a lot of patience and nothing much else to do on laptop, you need to consider alternatives.

Research IT can advise you on the alternatives. We provide an RDS service to store and share large datasets and the CSF service to run jobs on those datasets, freeing up your own machine for other tasks.

If you have never used HPC facilities before, then there is a small initial learning curve, but the RIT Platforms team have provided extensive documentation and tutorials to help you on your way. They also run training courses and it is possible to contact them via a Connect ticket to ask for assistance with things like setting up storage or running specific scripts/applications.