

Id_data = toolbox.sampling_function(ID, ids, rows, n, m)īatches = To reduce the overhead the computation is not split up for each ID of the ~8000 IDs in my dataset, but for a batch (whose optimal size I still have to tweak): get_pairs_for_id_batch(id_batch : np.array, ids : np.array, rows : np.array, n : int, m : int) So now I am using my toolbox, which uses numpy and numba, inside a delayed function and split the work over several workers. Rows = da.from_array(dataset.rows, chunks=10000)ĭata = permutation_example(2, ids, rows, 10).compute()Īs this did not work I had a look on the inputs/outputs of a delayed function and the documentation and saw, that I get the input as a numpy array anyway. Ids = da.from_array(dataset.ids, chunks=10000)

For each sample still in my dataset I want to match n other samples of the same ID and m samples of other IDs.įirst I used the da.random.permutation on my masked rows of the same ID (roughly 50-300 samples) and than take the first n elements: permutation_example(ID, ids, rows, n): In my use case I have a set of IDs, where each ID corresponds to a person and a set of rows, where each row entry is the row/unique-id of a sample in the dataset. of the selection is unknown.Īny help is appreciated, thanks in advance. My initial guess is that dask can not perform the operation as the resulting size/type etc. So permutating only a small subset of 300 samples is therefore more efficient.Īlso, I do not really understand the resulting error.
#Np random permutation code#
The issue is that in my original code the used array has ~ 3mio entries, but the valid entries are only ~300-500 items. nonzero() on my masked_array, but then the entire array is permutated. TypeError: 'float' object cannot be interpreted as an integer ~\Miniconda3\envs\venv\lib\site-packages\dask\array\random.py in permutation(self, x) > 4 permutation = da.random.permutation(masked_array.nonzero()) TypeError Traceback (most recent call last)ģ masked_array = da.ma.masked_equal(arr, 1)

Running this code results in the following error:. Permutation = da.random.permutation(masked_array.nonzero()) Masked_array = da.ma.masked_equal(arr, 1) If x is a multi-dimensional array, it is only shuffled along its first index. These indices are then permutated to form a random order: raw = np.arange(4, dtype=np.int32).repeat(4) Randomly permute a sequence, or return a permuted range. Please consider the following code snippet, where I create an array mask and use this mask to extract the indices of items matching my query.
