Graph Neural Networking
Challenge 2020

Dataset description

This repository contains datasets with simulation results of delay, jitter and packet loss for three different network topologies: NSFNET (14 nodes), GBN (17 nodes) and GEANT2 (24 nodes). The dataset is divided in two compressed files (.tar.gz) corresponding to the training and validation datasets. The training dataset contains only samples simulated in the NSFNET and GEANT2 network topologies, while the validation dataset contains samples simulated in GBN. Likewise, the test dataset that we will release during the evaluation phase will contain only samples simulated in other topology, with very similar distributions to the samples used in the validation dataset. In particular, the dataset has its focus on adding queue scheduling policies, where each node is configured with different scheduling policies according to the scenarios explained below. The following policies are used: Strict Priority (SP), where packets in queues with more priority are transmitted first. Weighted Fair Queueing (WFQ) and Deficit Round Robin (DRR). For all the queue scheduling policies implemented, we consider always three queues on each node’s output port. Thus, each sample of the dataset includes a traffic matrix where flows may have 3 different Type of Services (ToS=[0,1,2]) respectively associated to one of the three queues with different priority (e.g., ToS=0 is associated to the first queue of nodes). At the beginning of each simulation, a ToS is associated to the flows generated in a source-destination path. All the packets of the path will have the same ToS. Packets are generated on each flow following a Poisson distribution. To this end, we use an exponential time distribution (TimeDist.EXPONENTIAL_T) to model inter-packet arrival times. Likewise, we use a binomial distribution (SizeDist.BINOMIAL_S) to model packet size. The maximum bitrate that paths may have (maxAvgLambda) is selected randomly for each simulation, between 400 and 2000 bits per time unit. Note that packet delay is limited to 20 time units for all the simulation scenarios.

Queue scheduling

The datasets contain samples randomly ordered from 4 different queue scheduling scenarios and including the same proportion of samples of each scenario (i.e., 25% each scenario). These 4 scenarios are described below: Scenario 1: In this scenario all nodes are configured with a WFQ policy and the weights assigned to each queue are 60% for flows with ToS=0, 30% for ToS=1 and 10% for ToS=2. ToS are assigned to flows with the following probability: 10% for ToS=0, 30% for ToS=1 and 60% for ToS=2. Scenario 2: Like in scenario 1, all nodes of the topology are configured with a WFQ policy, but, in this case, we define 5 different profiles including different queue weights. Each node is configured randomly with one of these profiles. A total of 100 scheduling configurations (i.e., nodes + weight profiles) are generated with this criterion for each topology, and each simulation selects randomly one of these configurations.
Weight profileToS = 0ToS = 1ToS = 2
190%5%5%
233.3%33.3%33.3%
360%30%10%
450%40%10%
575%25%5%
ToS are assigned to flows with the following probability: 10% for ToS=0, 30% for ToS=1 and 60% for ToS=2. Scenario 3: In this scenario each node is configured randomly with one of the queue scheduling policies (SP, WFQ or DRR). For WFQ and DRR, we also define two sets with 5 queue weights profiles respectively. A total of 100 scheduling configurations (i.e., nodes + policies + weight profiles) are generated with this criterion for each topology, and each simulation selects randomly one of these configurations.

WFQ Profiles

Weight profileToS = 0ToS = 1ToS = 2
190%5%5%
233.3%33.3%33.3%
360%30%10%
450%40%10%
575%25%5%

DRR Profiles

Weight profileToS = 0ToS = 1ToS = 2
180%10%10%
233,3%33,3%33,3%
360%30%10%
470%20%10%
565%25%10%
ToS are assigned to flows with the following probability: 10% for ToS=0, 30% for ToS=1 and 60% for ToS=2 Scenario 4: This scenario is like scenario 3, but ToS are assigned to flows equiprobably (i.e., 33,3% each one). Routing configuration For each topology, we define 100 routing configurations which are variations of shortest path. For each simulation, one routing is selected randomly between the 100 candidates.

Processing the dataset

It is highly recommended to use the API provided [here] to easily read and process samples from the dataset. However, if you prefer to use directly the raw data, you can find the description of the dataset format below. The root directory of the compressed files contains the ‘routings’ directory where routing configuration files are stored. These files include a matrix describing the destination-based Routing Information Base (RIB) at each node: Routing_matrix(src node , dst node) = output port to reach the dst node from the src node. In the ‘graphs’ directory we locate the topologies and their features associated. Each topology file describes the nodes and links of the topology in Graph Modeling Language (GML) including the queue scheduling policy used for each node. This file can be processed with the networkx library using the read_gml method:
G= networkx.read_gml(topology_file, destringizer=int)
Finally, we have a set of compressed files with 100 simulation samples each one.  Each of these files contain the following data:
  • input_files.txt: Each line of this file contains the simulation number, the topology file, and the routing file used for that simulation.
  • traffic.txt: Contains the traffic parameters used by the simulator to generate the traffic for each iteration. At the beginning of each line we have the maxAvgLambda selected for this iteration. This parameter is separated from the rest of the information with the ‘|’ character. The rest of the line corresponds to the parameters used to generate the traffic for each path. Paths information are separated with semicolon (;) and the parameters used for those paths are separated with commas (,). The parameters of each path depend on the time an packet size distribution used and is structured as follows:

< time_distibution >,< equivalent_lambda >, < time_dist_param_1 >,..,< time_dist_param_n >,< size_distibution >, < avg_pkt_size >, < size_dist_param_1 >,..,< size_dist_param_n >< ToS >.

For this dataset the time_distribution is always Poisson (i.e., =0) and the size distribution is binomial (i.e., =2).

The rest of parameters for the Poisson/Exponential distribution are: • Avg number of packets per time unit considering packets of avg_pkt_size. • Exponential max factor: Factor used to define an upper bound for exponential distributions. The upper bound is defined as: ‘ExpMaxFactor’* equivalent_lambda.

The rest of parameters for the binomial distribution are: • Packet size 1: First packet size option (bits). • Packet size 2: Second packet size option (bits).

  • simulationResults.txt: Contains the measurements obtained by our network simulator for every sample. Each line in ‘simulationResults.txt’ corresponds to a simulation using the topology, routing and queue scheduling configuration specified in the ‘input_files.txt’, and the input traffic matrices specified in the ‘traffic.txt’ file. At the beginning of the line, and separated by “|”, there are global network statistics separated by commas (,). These global parameters are:
      1. global_packets: Number of packets transmitted in the network per time unit (packets/time unit).
      2. global_losses: Packets lost in the network per time unit (packets/time unit).
      3. global_delay: Average per-packet delay over all the packets transmitted in the network (time units)

After the “|” and separated by semicolon (;) we have the list of all path. Finally the metrics of the related to a path are separated by semicolon (;). Likewise, the different measurements (e.g., delay, jitter) for each path are separated by commas (,). So, to obtain a pointer to the metrics of a specific path from ‘node_src’ to ‘node_dst’, one can split the CSV format considering the semicolon (;) as separator:

list_metrics[src_node*n+dst_node] = path_metrics (from src to dst)

Where ‘list_metrics’ is the line of the file after the “|” character. Note that in a topology with ‘n’ nodes, nodes are enumerated in the range [0, n-1] This pointer will return a list of measurements for a particular src-dst path. In this list measurements are separated by comma (,) and provide the following measurements:

    1. Bandwidth (in kbits/time unit) trasmitted in each source-destination pair in the network (in both directions).
    2. Absolute number of packets transmitted in each src-dst pair (in both directions).
    3. Absolute number of packets dropped in each src-dst pair.
    4. Average per-packet delay over the packets transmitted in each src-dst pair (in time units).
    5. Average neperian logarithm of the per-packet delay over the packets transmitted in each src-dst pair (in time units). This is avg(Ln(packet_delay)).
    6. Percentile 10 of the per-packet delay over the packets transmitted in each src-dst pair (in time units).
    7. Percentile 20 of the per-packet delay over the packets transmitted in each src-dst pair (in time units).
    8. Percentile 50 (median) of the per-packet delay over the packets transmitted in each src-dst pair (in time units).
    9. Percentile 80 of the per-packet delay over the packets transmitted in each src-dst pair (in time units).
    10. Percentile 90 of the per-packet delay over the packets transmitted in each src-dst pair (in time units).
    11. Variance of the per-packet delay (jitter) over the packets transmitted in each source-destination pair.
  • stability.txt: Contains some extra information used to evaluate the status of the dataset. The more relevant parameter from this file is the simulation time required to reach the stability condition which, for each simulation, is the first element of the line.
The datasets can be downloaded from these links:
FileSizeMD5SUM
Training Dataset5,97GBf0060fba6b4ac9f761d58799d5b555d0
Validation Dataset1,12GBab0a215e18577f7964cd1431387a3f68
Test Dataset289MB13f2224276d6a05c3bd7e499e8e2dac1
Test Dataset (with labels)689MBda183826f9274c310564b9e1616e6f43
Please, if you have any doubt on how to process the datasets do not hesitate to register and send a mail to the mailing list challenge-kdn@knowledgedefinednetworking.org.