Job - dataLoader_{site}.py

This page contains an in-depth description on the dataLoader file

We allow users to define a custom data loader for each client site seperately to deal with problems like data heterogenity and class imbalance. You can define your data loader for each site according to the data description provided by us for each site on the upload job page.

Function definitions

Please note that you cannot change the following function names and input variables. The following function names should exist in this file, however, you can modify the content inside them as per your requirement.

  1. reSampler - This function defines how you want to sample the data from the site. You can define your data subset according to your custom class distributions or the choice of class labels. It takes two inputs data, data_size where data is a pandas dataframe with column names Image Index and Finding Labels . The function returns the data subset you have chosen as a dataframe.

  2. imgReader - This function defines how each data point at the site is to be read. It takes the streamedFile and transform variable as inputs. streamedFile is the loaded file with bytes data type whereas transform variable contains the transformation function you defined in nnMetrics.py. You can define your custom image loader inside this function and return the image with float values.

The data loader file should be defined for each site in your input.json's site variable. The name of the data loader should be as follows - dataLoader_{site-id}.py - where site-id is the id of the site. For eg. if your site variable in input.json is "site-1,site-2" then you will have two data loader files with names dataLoader_1.py and dataLoader_2.py

Sample file

from PIL import Image
import cv2
from torchvision import transforms
import pandas as pd
import numpy as np
import random

def reSampler(data, data_size):
    sample_weights1 = data['Finding Labels'].map(lambda x: len(x.split('|')) if len(x)>0 else 0).values + 4e-2
    sample_weights1 /= sample_weights1.sum()
    data = data.sample(data_size, weights=sample_weights1, random_state=0)    
    return data

def imgReader(streamedFile, transform):
    file_byte_string = streamedFile.read()
    image = np.array(cv2.imdecode(np.asarray(bytearray(file_byte_string)), cv2.IMREAD_COLOR))
    image = (image * 255).round().astype(np.uint8)
    image = Image.fromarray(image)
    if transform is not None:
        image = transform(image)
    return image.float()

Last updated