Linux internals and network programming: August 2016

hvn-network

Friday, August 26, 2016

Condition variable

Created by Thomas.

Condition variables: waiting and signaling

A mutex is for locking and a condition variable if for waiting.

A condition variable is always associated with a mutex. In general, some thread are waiting on a condition variable for changing status of shared resource,which is protected by a mutex.
When a thread gains the mutex and change status of shared resource, it will release the mutex and 'signal' to waiting threads via condition variable.

Note that waiting threads is on sleeping status until they are waked up.

The first purpose of using condition variable is when some threads needed to wait on a satisfied condition to do some actions. The second purpose is to utilize CPU effectively. In some situations, a thread needs periodically to check the changing status of a shared resource to do some job. In that case, CPU is needed to allocate time to make checking periodically. With using condition variable, the thread can go to sleep if the status of the shared resource has not changed. The thread is waked up when another thread changes status of the shared resource and notify/signal to the sleeping thread to handle the job.

The following functions are used.

int pthread_cond_wait(pthread_cond_t *cptr, pthread_mutex_t *mptr);
int pthread_cond_signal(pthread_cond_t *cptr);

We explain how condition variable works by below example on producer-consumer problem.

This code doesn't use condition variable.

static pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t cond = PTHREAD_COND_INITIALIZER;

static int avail = 0;

static void *
threadFunc(void *arg)
{
    int cnt = atoi((char *) arg);
    int j;
    
 //pthread_id_np_t   tid;
    //tid = pthread_getthreadid_np();

    for (j = 0; j < cnt; j++) {
        sleep(1);

        /* Code to produce a unit omitted */

        pthread_mutex_lock(&mtx);   
        avail++;        /* Let consumer know another unit is available */
        printf("thread:%d, #items=%d\n", (int) pthread_self(), avail);
        pthread_mutex_unlock(&mtx);
              
    }

    return NULL;
}

int
main(int argc, char *argv[])
{
    pthread_t tid;
    int s, j;
    int totRequired;            /* Total number of units that all threads
                                   will produce */
    int numConsumed;            /* Total units so far consumed */
    int done;
    time_t t;
    t = time(NULL);
    /* Create all threads */
    totRequired = 0;
    for (j = 1; j < argc; j++) {
        totRequired += atoi(argv[j]);

        pthread_create(&tid, NULL, threadFunc, argv[j]);
      
    }
    /* Loop to consume available units */
    numConsumed = 0;
    done = 0;
    for (;;) {
        pthread_mutex_lock(&mtx);               
        /* At this point, 'mtx' is locked... */

        while (avail > 0) {             /* Consume all available units */

            /* Do something with produced unit */

            numConsumed ++;
            avail--;
            printf("T=%ld: numConsumed=%d\n", (long) (time(NULL) - t),
                    numConsumed);

            done = numConsumed >= totRequired;
        }
        pthread_mutex_unlock(&mtx);      
        if (done)
            break;
        /* Perhaps do other work here that does not require mutex lock */
    }
    exit(0);
}

The output is below.

Below code using condition variable.

static pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t cond = PTHREAD_COND_INITIALIZER;

static int avail = 0;

static void *
threadFunc(void *arg)
{
    int cnt = atoi((char *) arg);
    int j;

    for (j = 0; j < cnt; j++) {
        sleep(1);
        /* Code to produce a unit omitted */
        pthread_mutex_lock(&mtx);        

        avail++;        /* Let consumer know another unit is available */
  printf("thread:%d, #items=%d\n", (int) pthread_self(), avail);
        pthread_mutex_unlock(&mtx);
       
        pthread_cond_signal(&cond);         /* Wake sleeping consumer */      
    }
    return NULL;
}

int
main(int argc, char *argv[])
{
    pthread_t tid;
    int j;
    int totRequired;            /* Total number of units that all threads
                                   will produce */
    int numConsumed;            /* Total units so far consumed */
    int done;
    time_t t;

    t = time(NULL);

    /* Create all threads */

    totRequired = 0;
    for (j = 1; j < argc; j++) {
        totRequired += atoi(argv[j]);

        pthread_create(&tid, NULL, threadFunc, argv[j]);       
    }
    /* Loop to consume available units */

    numConsumed = 0;
    done = 0;
    for (;;) {
        pthread_mutex_lock(&mtx);
       
        while (avail == 0) {            /* Wait for something to consume */
  // steps on pthread_cond_wait()
  // 1. after get lock mutex above, check condition variable
  // 2. if condition variable is not signaled yet, auto release mutex and go to sleep status
  // 3. When time that condition variable is signaled, wake up current thread (consumer) 
  // and auto lock mutex (it tried to get mutex but not sure it absolutely get the mutex?) 
  // and do next steps on shared resource 
  // after finish action on shared resource, release mutex. 
        printf("consumer: I am waiting\n");  
  pthread_cond_wait(&cond, &mtx);         
        }
  printf("consumer: I got signal\n");
        /* At this point, 'mtx' is locked... */
        while (avail > 0) {             /* Consume all available units */
            /* Do something with produced unit */
            numConsumed ++;
            avail--;
            printf("T=%ld: numConsumed=%d\n", (long) (time(NULL) - t),
                    numConsumed);
            done = numConsumed >= totRequired;
        }
        pthread_mutex_unlock(&mtx);       
        if (done)
            break;
        /* Perhaps do other work here that does not require mutex lock */
    }
    exit(0);
}

Output is below.

We have another example.

Main thread starts to increase 'count' when the count is larger than a threshold. So the condition that main thread is waiting is that the counter is larger than a threshold. Before that, main thread must waits the status change of 'count' (so it goes to sleep). Another thread increases count from beginning.

static pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t cond = PTHREAD_COND_INITIALIZER;

static int count = 0;
int start = 20;
static void *
threadFunc(void *arg)
{

    int s, j;

    for (;;) {
        sleep(1);
      
        s = pthread_mutex_lock(&mtx);        

        count++;        /* Let consumer know another unit is countable */
  printf("Sub-thread:%d, count=%d\n", (int) pthread_self(), count);
        s = pthread_mutex_unlock(&mtx);
        if (count > start)
        s = pthread_cond_signal(&cond);         /* Wake sleeping Main */      
    }
    return NULL;
}

int
main(int argc, char *argv[])
{
    pthread_t tid;
    int s, j;
    int totRequired = 100;           
                                       
    int done;
  
 s = pthread_create(&tid, NULL, threadFunc, NULL); 
    done = 0;

 pthread_mutex_lock(&mtx);
   
 while (count <= start) {            
 // steps on pthread_cond_wait()
 // 1. after get lock mutex above, check condition variable
 // 2. if condition variable is not signaled yet, auto release mutex and go to sleep status
 // 3. When time that condition variable is signaled, wake up current thread (consumer) and auto lock mutex and do next steps on shared resource 
 // after finish action on shared resource, release mutex. 
 printf("Main thread: waiting here until count=%d\n", start);  
 s = pthread_cond_wait(&cond, &mtx);         
 }
 /* At this point, 'mtx' is locked... */
 printf("Main thread: start running, count=%d\n", count);
 pthread_mutex_unlock(&mtx);
 while (count > 0) {             /* Consume all countable units */
  /* Do something with produced unit */
  pthread_mutex_lock(&mtx);
  count++;
  printf("Main thread: count=%d\n",count);        
  done = count >= totRequired;
  s = pthread_mutex_unlock(&mtx);
  sleep(2);
  if (done)
  break;
 }
                    
    exit(0);
}

Output is below.

Resources
[1] The Linux programming interface, Michael Kerrisk
[2] Unix network programming Vol 2, W. Richard Stevens

Mutex

Created by Thomas

Mutexes: Locking and unlocking

A mutex (mutual exclusion) is the most basic form of synchronization.
A mutex is to protect a critical region, making sure that only one thread or only one process (if mutex is shared by processes) can access the critical region on given time.

The normal outline of code.

lock_the_mutex(..);
critical region
unlock_the_mutex(..);

We can allocate mutex statically.

static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

or can allocate dynamically, then we must initialize it by pthread_mutex_init().

Following function to lock and unlock a mutex

int pthread_mutex_lock(pthread_mutex_t *mptr);
int pthread_mutex_unlock(pthread_mutex_t *mptr);

We use below example to show how mutex works.

In first program, two threads increases a common global variable without using mutex.

int glob = 0;   
                                   
static void * threadFunc_a(void *arg) {
    int loops = *((int *) arg);
    int loc, j;

    for (j = 0; j < loops; j++) {
        loc = glob;
        loc++;
        glob = loc;  
    }

    return NULL;
}

static void * threadFunc_b(void *arg) {
    int loops = *((int *) arg);
    int loc, j;

    for (j = 0; j < loops; j++) {
        loc = glob;
        loc++;
        glob = loc;  
    }

    return NULL;
}

int
main(int argc, char *argv[])
{
    pthread_t t1, t2;
    int loops, s;

    loops = argc;

    s = pthread_create(&t1, NULL, threadFunc_a, &loops);
    if (s != 0)
        perror(s, "pthread_create");
    s = pthread_create(&t2, NULL, threadFunc_b, &loops);
    if (s != 0)
        perror("pthread_create");

    s = pthread_join(t1, NULL);
    if (s != 0)
        perror("pthread_join");
    s = pthread_join(t2, NULL);
    if (s != 0)
        perror("pthread_join");

    printf("glob = %d\n", glob);
    exit(0);
}

For each thread, it increases the global variable 'loops' times.
First we copy the shared global variable to a local variable on it own stack, then increases the local variable, finally copy the local variable back to the shared global variable.

Output as below.

When 'loops' value is small (i.e 200) the result is expected. Remember that each thread increases independently the shared global value up to 'loops' times. Therefore, the final result should be double.

We note that when 'loops' value is large, the increase process is not correct. The reason is explained shortly below.

Assume thread 1 copy glob variable (2000) to local variable. After that CPU time for it expires, leading thread 2 has chance to increase the glob variable. Suppose that thread 2 get time of CPU on it to increase the glob variable up to 3000. At this time, thread one obtains CPU again, it copies the value from local variable to global variable. Therefore, current value in glob variable changes from 3000 to 2000. That means that the increasing process of thread 2 is discarded.

To resolve this problem, we just make sure at any given time, only one thread can increase the glob variable. The code is modified as below.

int glob = 0;   
pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;

static void * threadFunc_a(void *arg) {
    int loops = *((int *) arg);
    int loc, j;

    for (j = 0; j < loops; j++) {
  pthread_mutex_lock(&mtx);
        loc = glob;
        loc++;
        glob = loc; 
  pthread_mutex_unlock(&mtx);
    }

    return NULL;
}

static void * threadFunc_b(void *arg) {
    int loops = *((int *) arg);
    int loc, j;

    for (j = 0; j < loops; j++) {
  pthread_mutex_lock(&mtx);
        loc = glob;
        loc++;
        glob = loc; 
  pthread_mutex_unlock(&mtx);
    }

    return NULL;
}

int
main(int argc, char *argv[])
{
    pthread_t t1, t2;
    int loops, s;
    if (argc != 2)
 {
  printf("usage: a.out <#loops>\n");
  exit(0);
 } 
    loops = atoi(argv[1]);

    s = pthread_create(&t1, NULL, threadFunc_a, &loops);
    if (s != 0)
        perror(s, "pthread_create");
    s = pthread_create(&t2, NULL, threadFunc_b, &loops);
    if (s != 0)
        perror("pthread_create");

    s = pthread_join(t1, NULL);
    if (s != 0)
        perror("pthread_join");
    s = pthread_join(t2, NULL);
    if (s != 0)
        perror("pthread_join");

    printf("glob = %d\n", glob);
    exit(0);
}

The correct result is following.

Resources
[1] The Linux programming interface, Michael Kerrisk
[2] Unix network programming Vol 2, W. Richard Stevens

Thursday, August 25, 2016

Wireless LANs

This page covers topics on Wireless LANs.

Routing and Switching

This page covers topics o routing and switching.

Introduction on Networking

This page covers topics on networking.

Packet flow in switching/routing
Socket review
NAT
Multicast

Wednesday, August 24, 2016

Broadcast and Local Multicast in Networking

I. Introduction

This article will discuss about multicast, including how link-layer addressing can be used to send multicast or broadcast traffic efficiently from one computer to several others. It also examine the Internet Group Management Protocol (IGMP) [RGC3376] used in IPv4 and Multicast Listener Discovery (MLD) [RFC3810] used in IPv6, which are used to inform IPv4 and IPv6 multicast routers which multicast addresses are in use on a subnetwork. This article doesn't cover how multicast routing is implemented in wide area networks such as the global Internet.

There are four kind of IP addresses that being used in Internet: unicast, anycast, multicast and broadcast.

Boardcasting and multicast provide two services for application: delivery of packets to multiple destinations and solicitation/discovery of servers by clients.

Delivery to multiple destinations

There are many applications that deliver information to multiple recipients: interactive conferencing and dissemination of mail or news to multiple recipients, for example. Without broadcasting or multicasting, these types of services tend to use TCP today (delivering a separate copy to each destination, which can be very inefficient).

Solicitation of servers by client

Using broadcasting or multicasting, an application can send a request for a server without knowingany particular server’s IP address. This capability is very useful during configuration when little is known about the local networking environment. A laptop, for example, might need to get its initial IP address and find its nearest router using DHCP.

Although both broadcasting and multicasting can provide these important capabilities, multicasting is generally preferable to broadcasting because multicasting involves only those systems that support or use a particular service or protocol, and broadcasting does not. Thus, a broadcast request affects all hosts that are reachable within the scope of the broadcast, whereas multicast affects only those hosts that are likely to be interested in the request. There is a trade-off between the higher overhead and simplicity of broadcast and the improved efficiency but greater complexity associated with multicast.

Generally, only user applications that use the UDP transport protocol take advantage of broadcasting and multicasting, where it makes sense for an application to send a single message to multiple recipients. TCP is a connectionoriented protocol that implies a connection between two hosts (specified by IP addresses) and one process on each host (specified by port numbers). TCP can use unicast and anycast addresses (recall that anycast addresses behave like unicast addresses), but not broadcast or multicast addresses.

II. Boardcasting

Broadcasting refers to sending a message to all possible receivers in a network. In principle, this is simple: a router simply forwards a copy of any message it receives out of every interface other than the one on which the message arrived.

On an Ethernet network, a multicast MAC address has the loworder bit of the highorder byte turned on.

In hexadecimal this looks like 01:00:00:00:00:00. We may consider the Ethernet broadcast address ff:ff:ff:ff:ff:ff as a special case of the Ethernet multicast address 255.255.255.255 corresponds to a local network (also called “limited”) broadcast.

III. Multicast

To reduce the amount of overhead involved in broadcasting, it is possible to send traffic only to those receivers that are interested in it. This is called multicasting. Fundamentally, this is accomplished by either having the sender indicate the receivers, or instead having the receivers independently indicate their interest. The network then becomes responsible for sending traffic only to intended/interested recipients. Implementing multicast is considerably more challenging than broadcast because multicast state (information) must be maintained by hosts and routers as to what traffic is of interest to what receivers.

1. Converting IP Multicast Addresses to 802 MAC/Ethernet Address

To carry IP multicast efficiently on a link-layer network, there should be a one to one mapping between packets and addresses at the IP layer and frames at the link layer. The IANA organization give multicast group MAC addresses in the range 01:00:5e:00:00:00 through 01:00:5e:7f:ff:ff. All IPv4 multicast addresses are contained within the address space from 224.0.0.0 to 239.255.255.255 (formerly known as class D address space). All such addresses share a common 4bit sequence of 1110 in the high order bits. Thus, there are 32 – 4 = 28 bits available to encode the entire space of 228 = 268,435,456 multicast IPv4 addresses (also called group IDs). For IPv4, all 268,435,456 IPv4 multicast group IDs need to be mapped into a link-layer address space containing only 223 = 8,388,608 unique entries. The mapping therefore is nonunique. That is, more than one IPv4 group ID is mapped to the same MAC layer group address. Specifically, 228/223 = 25 = 32 distinct IPv4 multicast group IDs are mapped to each group address. For example, both the multicast addresses 224.128.64.32 (hexadecimal e0.80.40.20) and 224.0.64.32 (hexadecimal e0.00.40.20) are mapped into the Ethernet address 01:00:5e:00:40:20.

The IPv4-to-IEEE-802 MAC multicast address mapping uses the lowerorder 23 bits of the IPv4 group address as the suffix of a MAC address starting with 01:00:5e. Because only 23 of the 28 group address bits are used, 32 groups are mapped to the same MAC-layer address.

For IPv6, the 16-bit hexadecimal prefix is 33:33. This means that the last 32 bits of the IPv6 address can be used to form the link-layer address. Thus, any address ending with the same 32 bits maps to the same MAC address .Given that all IPv6 multicast addresses begin with ff, and the subsequent 8 bits are used for flags and scope information, this leaves 128 – 16 = 112 bits for representing 2112 groups. Thus, with the 32 bits of MAC-layer address available to encode these groups, there can be as many as 2112/232 = 280 groups that map to the same MAC address!

The IPv6-to-IEEE-802 MAC multicast address mapping uses the loworder 32 bits of the IPv6 multicast address as the suffix of a MAC address starting with 33:33. Because only 32 of the 112 multicast address bits are used, 280 groups are mapped to the same MAC-layer address.

2. Receiving Multicast Datagrams

Fundamental to multicasting is the concept of a process joining or leaving one or more multicast groups on a given interface on a host. (We use the term process to mean a program being executed by the operating system, often on

behalf of a user.) Membership in a multicast group on a given interface is dynamic—it changes over time as processes join and leave groups. In addition to joining or leaving groups, additional methods are needed if a process wishes to specify sources it cares to hear from or exclude. These are required parts of any API on a host that supports multicasting. For more information about API that host required to support, refer to RFC 3376 document. We use the qualifier “interface” because membership in a group is associated with an interface. A process can join the same group on multiple interfaces, multiple groups on the same interface, or any combination thereof

3. Host Address Filtering

To understand how the operating system processes received multicast datagrams for multicast groups that programs have joined, remember that that filtering takes place on each host’s network interface card (NIC), each time a frame is presented to it (e.g., by a bridge or switch) for possible reception.

Each layer implements filtering on some portion of the received message. MAC address filtering can take place in either software or hardware. Cheaper NICs tend to impose a larger processing burden on software because they perform fewer functions in hardware.

In a typical switched Ethernet environment, broadcast and multicast frames are replicated on all segments within a VLAN, along a spanning tree formed among the switches. Such frames are delivered to the NIC on each host which checks the correctness of the frame (using the CRC) and makes a decision about whether to receive the frame and deliver it to the device driver and network stack. Normally the NIC receives only those frames whose destination address is either the hardware address of the interface or the broadcast address.

However, when multicast frames are involved, the situation is somewhat more complicated.

NICs tend to come in two varieties. One type performs filtering based on the hash values of the multicast hardware addresses in which the host software has expressed interest, which means that some unwanted frames can always get through because of hash collisions. The other type listens for a finite table of multicast addresses, meaning that if the host needs to receive frames destined for more multicast addresses than can fit in the table, the NIC is put into a “multicast promiscuous” mode, in which case all multicast traffic is given to the host software. Hence, both types of interfaces require that the device driver or higher layer software perform checking that the received frame is really wanted. Even if the interface performs perfect multicast filtering (based on the 48bit hardware address), because the mapping from a multicast IPv4 or IPv6 address to a 48bit hardware address is not unique, filtering is still required. Despite this imperfect address mapping and hardware filtering, multicasting is still more efficient than broadcasting.

For NICs that support a multi-entry address table, the destination address on each received frame is compared against this table, and if the address is found in the table, the frame is received and processed by the device driver. The entries of this table are managed by the device driver software in combination with other layers of the protocol stack (such as the IPv4 and IPv6 implementations). Once the NIC hardware has verified a frame as acceptable (i.e., the CRC is correct, any VLAN tags match, and the destination MAC address matches an address entry in one or more of the NIC’s tables), the frame is passed to the device driver, where additional filtering is performed. First, the frame type must specify a protocol that is supported (e.g., IPv4, IPv6, ARP, etc.). Second, additional multicast filtering may be performed to check whether the host belongs to the addressed multicast group (indicated by the destination IP address). This is necessary for NICs that may generate false positives. The device driver then passes the frame to the next layer, such as IP, if the frame type specifies an IP datagram. IP performs more filtering, based on the source and destination IP addresses, and passes the datagram up to the next layer (such as TCP or UDP) if all is well. Each time UDP receives a datagram from IP, it performs filtering based on the destination port number, and sometimes the source port number, too. If no process is currently using the destination port number, the datagram is discarded and an ICMPv4 or ICMPv6 Port Unreachable message is normally generated. (TCP performs similar filtering based on its port numbers.) If the UDP datagram has a checksum error, UDP silently discards it. One of the primary motivations behind the development of the multicast addressing features was to avoid the overhead of broadcasting. Consider an application that is designed to use UDP broadcasts. If there are 50 hosts on the network (or VLAN), but only 20 are participating in the application, every time one of the 20 sends a UDP broadcast, the other 30 nonparticipating hosts have to process the broadcast, all the way up through the UDP layer, before the UDP datagram is discarded. The UDP datagram is discarded by these 30 hosts because the destination port number is not in use. The intent of multicasting is to reduce this load on hosts with no interest in the application. With multicasting, a host specifically joins one or more multicast groups. If possible, the NIC is told which multicast groups the host belongs to, and only those multicast frames associated with the IPlayer multicast groups are allowed through the filter in the NIC. All of this machinery offers less overhead imposed on the host, in exchange for additional complexity in managing multicast addresses and group memberships.

4. The Internet Group Management Protocol (IGMP) and Multicast Listener Discovery Protocol (MLD)

Two major protocols are used to allow multicast routers to learn the groups in which nearby hosts are interested: the Internet Group Management Protocol (IGMP) used by IPv4 and the Multicast Listener Discovery (MLD) protocol used by IPv6. Both are used by hosts and routers that support multicasting, and the protocols are very similar. These protocols let the multicast routers on a LAN (VLAN) know which hosts currently belong to which multicast groups. This information is required by the routers so that they know which multicast datagrams to forward on to which interfaces. In most cases, a multicast router only requires knowledge that at least one listening host is reachable by a particular interface, as link-layer multicast addressing (assuming it is supported) permits the multicast router to send link-layer multicast frames that will be received by all interested listeners. This allows a multicast router to do its job without keeping track of every individual host on each interface that might be interested in multicast traffic for a particular group. IGMP has evolved over time, and [RFC3376] defines version 3 (the most current one at the time of writing). MLD has evolved in parallel, and its current version (2) is defined in [RFC3810]. IGMPv3 and/or MLDv2 are required for supporting SSM( Source Specific Multicast). See [RFC4604] for more details on how these protocols are restricted when using only a single source per multicast group. Version 1 of IGMP was the first commonly used version of IGMP. Version 2 added the ability to leave groups more quickly (also supported by MLDv1). IGMPv3 and MLDv2 add the ability to select the sources of multicast traffic and are required for deployment of SSM. While IGMP is a separate protocol used with IPv4, MLD is really part of ICMPv6.

Multicast routers send IGMP (MLD) requests to each attached subnet periodically to determine which groups and sources are of interest to the attached hosts. Hosts respond with reports indicating which groups and sources are of interest. Hosts may also send unsolicited reports if membership changes occur.

Such routers are interested in ascertaining which multicast groups are of interest on each of its attached interfaces.

These routers require this information in order to avoid simply broadcasting all traffic out of every interface.

In Figure above, we can see how IGMP (MLD) queries are sent by multicast routers. These are sent to the All Hosts multicast address, 224.0.0.1 (IGMP), or the All Nodes link-scope multicast address, ff02::1 (MLD), and processed by every host implementing IP multicast. Membership report messages are sent by group members (hosts) in response to the queries but may also be sent in an unsolicited way from hosts that wish to inform multicast routers that group membership(s) and/or interest in particular sources has changed. IGMPv3 reports are sent to the IGMPv3-capable multicast router address 224.0.0.22. MLDv2 reports are sent to the corresponding MLDv2 Listeners IPv6 multicast address ff02::16. Note that multicast routers themselves may also act as members when they join multicast groups.

The encapsulations for IGMP and MLD are shown in Figure below

IGMP is encapsulated as a separate protocol in IPv4. MLD is a type of ICMPv6 message.

IGMP and MLD define two sets of protocol processing rules: those performed by hosts that are group members and those performed by multicast routers. Generally speaking, the job of the member hosts (which we will call “group members”) is to spontaneously report changes in interest in multicast groups and sources and to respond to periodic queries. Multicast routers send queries to ascertain whether any interest is present on an attached link for any groups, or for a specific multicast group and source. Routers also interact with wide area multicast protocols to bring the desired traffic to the interested hosts or prohibit traffic from flowing to uninterested hosts.

4.1 IGMP and MLD Processing by Group Members (“Group Member Part”)

The group members’ portion of IGMP and MLD is designed to allow hosts to specify what groups they are interested in and whether traffic sent from particular sources should be accepted or filtered out. This is accomplished by sending reports to one or more multicast routers (and participating hosts) attached to the same subnet. Reports may be sent as a result of receiving a query, or spontaneously (unsolicited) because of a local change in reception state (e.g., an application joins or leaves a group). IGMP reports take the form shown in Figure below

The IGMPv3 membership report contains group records for N groups. Each group record indicates a multicast address and optional list of sources.

Report messages are fairly simple. They contain a vector of group records , each of which provides information about a particular multicast group, including the address of the subject group, and an optional list of sources used for establishing filters

An IGMPv3 group record includes a multicast address (group) and an optional list of sources. Groups of sources are either allowed as senders (include mode) or filtered out (exclude mode). Previous versions of IGMP reports did not include a list of sources.

Each group record contains a type, the address of the subject group, and a list of source addresses to either include or exclude. There is also support for including auxiliary data, but this feature is not used by IGMPv3. Table below reveals the significant flexibility that can be achieved using IGMPv3 report record types.

Type values for IGMP and MLD source lists indicate the filtering mode (include or exclude) and whether the source list has changed

MLD uses the same values. A list of sources is said to refer to include mode or exclude mode. In include mode, the sources in the list are the only sources from which traffic should be accepted. In exclude mode, the sources in the list are the ones to be filtered out (all others are allowed). Leaving a group can be expressed as using an include mode filter with no sources, and a simple join of a group (i.e., for any source) can be expressed as using the exclude mode filter with no sources.The first two message types (0x01, 0x02) are known as current-state records and are used to report the current filter state in response to a query. The next two (0x03, 0x04) are known as filter-mode-change records, which indicate a change from include to exclude mode or vice versa. The last two (0x05, 0x06) are known as source-list-change records and indicate a change to the sources being handled in either exclude or include mode. The last four types are also described more generally as state-change records or state-change reports. These are sent as a result of some local state change such as a new application being started or stopped, or a running application changing its group/source interests. Note that IGMP and MLD queries/reports themselves are never filtered. MLD reports use a structure similar to IGMP reports but accommodate larger addresses and use an ICMPv6 type code of 143.

When receiving a query, group members do not respond immediately. Instead, they set a random (bounded) timer to determine when to respond. During this delay interval, processes may alter their group/source interests. Any such modifications can be processed together before a timer expires to trigger the report. In this way, once the timer does expire, the status of multiple groups can more likely be merged into a single report, saving overhead. The source address used for IGMP is the primary or preferred IPv4 address of the sending interface. For MLD, the source address is a link-local IPv6 address.

4.2 IGMP and MLD Processing by Multicast Routers (“Multicast Router Part”)

In IGMP and MLD, the job of the multicast router is to determine, for each multicast group, interface, and source list, whether at least one group member is present to receive corresponding traffic. This is accomplished by sending queries and building state describing the existence of such members based on the reports they send. This state is soft state, meaning that it is cleared after a certain amount of time if not refreshed. To build this state, multicast routers send IGMPv3 queries of the form depicted in Figure below

The IGMPv3 query includes the multicast group address and optional list of sources. General queries use a group address of 0 and are sent to the All Hosts multicast address, 224.0.0.1. The QRV value encodes the maximum number of retransmissions the sender will use, and the QQIC field encodes the periodic query interval. Specific queries are used before terminating traffic flow for a group or source/group combination. In this case (and all cases with IGMPv2 or IGMPv1), the query is sent to the address of the subject group.

The IGMP query message is very similar to the ICMPv6 MLD query. In this case, the group (multicast) address is 32 bits in length and the Max Resp Code field is 8 bits instead of 16. The Max Resp Code field encodes the maximum amount of time the receiver of the query should delay before sending a report, encoded in 100ms units for values below 128. For values above 127, the field is encoded as shown in Figure below

The Max Resp Code field encodes the maximum time to delay responses in 100ms units. For values above 127, an exponential value can be used to accommodate larger values.

This encoding provides for a possible range of (16)(8) = 128 to (31)(1024) = 31,744 (i.e., about 13s to 53 minutes). Using smaller values for the Max Resp Code field allows for tuning the leave latency (the elapsed time from when the last group member leaves to the time corresponding traffic ceases to be forwarded). Larger values of this field reduce the traffic load of the IGMP messages generated by members by increasing the likelihood of longer periods for reporting. The remaining fields in a query include an Internet-style checksum across the whole message, the address of the subject group, a list of sources, and the S, QRV, and QQIC fields with MLD. In cases where the multicast router wishes to know about interest in all multicast groups, the Group Address field is set to 0 (such queries are called “general queries”). The S and QRV fields are used for fault tolerance and retransmission of reports. The QQIC field is the Querier’s Query Interval Code. This value is the query sending period, in units of seconds and encoded using the same method as the Max Resp Code field (i.e., a range from 0 to 31,744). There are three variants of the query message that can be sent by a multicast router: general query, group-specific query, and group-and-source-specific query. The first form is used by the multicast router to update information regarding any multicast group, and for such queries the group list is empty.

Group-specific queries are similar to general queries but are specific to the identified group. The last type is essentially a group-specific query with a set of sources included. The specific queries are sent to the destination IP address of the subject group, as opposed to general queries that are sent to the All Systems multicast address (for IPv4) or the link-scope All Nodes multicast address for IPv6 (ff02::1).

The specific queries are sent in response to state-change reports in order to verify that it is appropriate for the router to take some action (e.g., to ensure that no interest remains in a particular group before constructing a filter). When receiving either filter-mode-change records or source-list-change records, the multicast router arranges to add new traffic sources and may be able to filter out traffic from certain sources. In cases where the multicast router is prepared to begin filtering out traffic that was flowing previously, it uses the group-specific query and group-and-source-specific query first. If these queries elicit no reports, the router is free to begin filtering out the corresponding traffic. Because such changes can significantly affect the flow of multicast traffic, state-change reports and specific queries are retransmitted.

Reference
1. TCP/IP Illustrate - Volume I
2. RFC 3376 IGMPv3.

Tuesday, August 23, 2016

Socket

1. Socket and fundamental concepts

1.1 Definition

A socket is communication mechanism that allow client/server system to be developed either locally, on a single machine, or across networks. Linux function such as printing, connecting to database, and serving web pages as well as network utilities such as rlogin for remote login and ftp for file transfer usually use sockets to communicate
Another definition: A socket is just a logical endpoint for communication. They exist on the transport layer. You can send and receive things on a socket, you can bind and listen to a socket. A socket is specific to a protocol, machine, and port, and is addressed as such in the header of a packet.
A simple definition: A network socket is one endpoint in a communication flow between two programs running over a network

A simple socket model

1.2. Types of Socket
There are two types of socket used widely: stream sockets and datagram sockets

Stream Sockets (based on TCP protocol): is a stream oriented protocol. Delivery in a networked environment is guaranteed. If you send through the stream socket three items "A, B, C", they will arrive in the same order - "A, B, C". These sockets use TCP (Transmission Control Protocol) for data transmission. If delivery is impossible, the sender receives an error indicator. Data records do not have any boundaries.
Datagram Sockets (based on UDP protocol): is a message oriented protocol. Delivery in a networked environment is not guaranteed. They're connectionless because you don't need to have an open connection as in Stream Sockets - you build a packet with the destination information and send it out. They use UDP (User Datagram Protocol)

1.3 Client-server model
Most of the Net application use the Client-Server architecture, which refers to two processes or two applications that communicate with each other to exchange some information. One of the two processes acts as a client process and another process acts as a server.

Client process: This is the process, which typically makes a request for information. After getting the response, this process may terminate or may do some other processing. Example: Internet Browser (Chrome, Firefox, ..) works as a client application, which sends a request to the Web Server to get one HTML web page.
Server process: This is the process which takes a request from the clients. After getting a request from the client, this process will perform the required processing, gather the requested information, and send it to the request client. Once done, it becomes ready to server another client. Server processes are always alert and ready to serve incoming requests. Example: A web page server waiting for requests from Internet Browsers.

There are two types of server: Iterative server, concurrent server

Iterative Server: This is the simplest form of server where a server process serves one client and after completing the first request, it takes request from another client. Meanwhile, another client keeps waiting.
Concurrent Servers: This type of server runs multiple concurrent processes to serve many requests at a time because one process may tke longer and another client cannot wait for so long. The simplest way to write a concurrent server under Unix is to fork child process to handle each client separately

1.4 Ports and Services

When a client process wants to a connect a server, the client must have a way of identifying the server that it wants to connect. If the client knows the 32-bit Internet address (with IPv4, or 128-bit Internet address of Ipv6) of the host on which the server resides, it can contact that host. But how does the client identify the particular server process running on that host?

How to request a specific service

To resolve the problem of identifying a particular server process running on a host, both TP and UDP have defined a group of well-know ports that are in the range of 0 – 1023
For our different purposes, ports will be defined as an integer number between 1024-65535
The port assignments to network services can be found in the file /etc/services. If you are writing your own server then care must be taken to assign a port to your server. You should make sure that this port should not be assigned to any other server.

2. Socket connection
First, a server process gives the socket a name. Local sockets are given a filename in the Linux file system, often to be found in /tmp or /usr/tmp. For network sockets, the filename will be a service identifier allows Linux to route incoming connections specifying a particular port number to the correct server process.
2.1 Socket parameter
ü  The protocol (TCP, UDP, …)
ü  The local address
ü  The remote address
ü  The local port number
ü  The remote port number

Socket Parameter

2.2 Steps to establish a socket pair

Establish a socket connection

2.3 Core functions

Creating a socket

Socket Addresses: Each socket domain requires its own address format.

int socket(int domain, int type, int protocol);
}

struct socketaddr_un {
 sa_family_t sun_family;
 char sun_path[];
}

struct sockaddr_in {
 short int sin_family;  // AF_INET
 unsighed short int sin_port; // port number
 struct in_addr  sin_addr; // Internet address
}

The IP address structure, in_addr, is defined as follows

struct in_addr {
 unsigned long int s_addr;
};

Naming a Socket
To make a socket available for use by other processes, a server program needs to give the socket a name
#include <sys/socket.h>
int bind(int socket, const struct sockaddr *address, size_t address_len)
Creating a Socket Queue
#include <sys/socket.h>
int listen(int socket, int backog)
Accepting Connections
Once a server program has created and named a socket, it can wait for connections to be made to the socket by using the accept call
#include <sys/socket.h>
int accept(int socket, struct sockaddr *address size_t *address_len)
Requesting connections
Client program connect to servers by establishing a connection between an unnamed socket and the server listen socket
#include <sys/socket.h>
int connect(int socket, const struct socketaddr *address, size_t address_len);
Close a socket
You should always close the socket at both ends.

3. Unix internal Socket and Unix network Socket
3.1 Unix internal Socket - Creating local Socket
You can download source here

Local server

Local client

3.2 Unix network Socket - Creating network Socket
You can download source here

Network server

Network client

4. Multiple client using fork()
4.1 Concurrent server

Server can serve more than one client in a time. We can see the way server do that by fork() to create child processes

Client connects to child server

Two clients connect to two child servers

4.2 Status of client/server before call to accept return

Status of client/server when client request connection

Status of client/server after return from accept

Status of client/server after fork return

Status of client/server after parent and child close appropriate sockets

4.3 Outline for typical concurrent server

pid_t pid;
int listenfd, connfd;
listenfd = Socket( ... );
/* fill in sockaddr_in{} with server's well-known port */
bind(listenfd, ... );
listen(listenfd, LISTENQ);
for ( ; ; ) 
{
 connfd = accept (listenfd, ... ); /* probably blocks */
 if( (pid = fork()) == 0) 
 {
  close(listenfd); /* child closes listening socket */
  doit(connfd); /* process the request */
  close(connfd); /* done with this client */
  exit(0); /* child terminates */
 }
 close(connfd); /* parent closes connected socket */
}

Code full:

Concurrent server

Client

Interative server

4.4 Prevent zombie process

Using fork() to create child process to handle connections from clients causes “zombie process”. Obviously we do not want to leave ombies around. They take up space in the kernel and evenrually can run out of processes. We need to prevent that by handling SIGCHLD Signals. Whenever we fork children, we must wait for them to prevent them from becoming zombies. To di this, we establish a signal-handle to catch SIGCHLD, and within the handler, we call wait. We establish the signal handler by adding the function call

Signal(SIGCHLD, sig_chld);

References

http://www.tutorialspoint.com/unix_sockets/

http://www.cs.toronto.edu/~yganjali/resources/Course-Handouts/CSC458/H03--CSC458-Tutorial-I.pdf

http://whatis.techtarget.com/definition/sockets

https://sites.google.com/site/embedded247/npcourse/lap-trinh-c-socket

Beginning Linux Programming

Unix network programming

Linux internals and network programming

My Blog List