From 99bbc70741903c063b3ccad90a3e06fc55df9245 Mon Sep 17 00:00:00 2001 From: Willem de Bruijn Date: Mon, 20 May 2013 04:02:32 +0000 Subject: rps: selective flow shedding during softnet overflow A cpu executing the network receive path sheds packets when its input queue grows to netdev_max_backlog. A single high rate flow (such as a spoofed source DoS) can exceed a single cpu processing rate and will degrade throughput of other flows hashed onto the same cpu. This patch adds a more fine grained hashtable. If the netdev backlog is above a threshold, IRQ cpus track the ratio of total traffic of each flow (using 4096 buckets, configurable). The ratio is measured by counting the number of packets per flow over the last 256 packets from the source cpu. Any flow that occupies a large fraction of this (set at 50%) will see packet drop while above the threshold. Tested: Setup is a muli-threaded UDP echo server with network rx IRQ on cpu0, kernel receive (RPS) on cpu0 and application threads on cpus 2--7 each handling 20k req/s. Throughput halves when hit with a 400 kpps antagonist storm. With this patch applied, antagonist overload is dropped and the server processes its complete load. The patch is effective when kernel receive processing is the bottleneck. The above RPS scenario is a extreme, but the same is reached with RFS and sufficient kernel processing (iptables, packet socket tap, ..). Signed-off-by: Willem de Bruijn Acked-by: Eric Dumazet Signed-off-by: David S. Miller --- include/linux/netdevice.h | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index a94a5a0ab122..7dd535d4b41e 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1778,6 +1778,19 @@ static inline int unregister_gifconf(unsigned int family) return register_gifconf(family, NULL); } +#ifdef CONFIG_NET_FLOW_LIMIT +#define FLOW_LIMIT_HISTORY (1 << 8) /* must be ^2 */ +struct sd_flow_limit { + u64 count; + unsigned int num_buckets; + unsigned int history_head; + u16 history[FLOW_LIMIT_HISTORY]; + u8 buckets[]; +}; + +extern int netdev_flow_limit_table_len; +#endif /* CONFIG_NET_FLOW_LIMIT */ + /* * Incoming packets are placed on per-cpu queues */ @@ -1807,6 +1820,10 @@ struct softnet_data { unsigned int dropped; struct sk_buff_head input_pkt_queue; struct napi_struct backlog; + +#ifdef CONFIG_NET_FLOW_LIMIT + struct sd_flow_limit *flow_limit; +#endif }; static inline void input_queue_head_incr(struct softnet_data *sd) -- cgit v1.2.3 From 42e52bf9e3ae80fd44b21ddfcd64c54e6db2ff76 Mon Sep 17 00:00:00 2001 From: Jiri Pirko Date: Sat, 25 May 2013 04:12:10 +0000 Subject: net: add netnotifier event for upper device change Now when upper device is changed, event is not propagated via RT Netlink to userspace. Userspace might never now about the change. Fix this by adding upper-device-change notifier event. Signed-off-by: Jiri Pirko Signed-off-by: David S. Miller --- include/linux/netdevice.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 0ebd63ae2cc8..ea7b6bce9ea0 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1593,6 +1593,7 @@ struct packet_offload { #define NETDEV_RELEASE 0x0012 #define NETDEV_NOTIFY_PEERS 0x0013 #define NETDEV_JOIN 0x0014 +#define NETDEV_CHANGEUPPER 0x0015 extern int register_netdevice_notifier(struct notifier_block *nb); extern int unregister_netdevice_notifier(struct notifier_block *nb); -- cgit v1.2.3 From 0d89d2035fe063461a5ddb609b2c12e7fb006e44 Mon Sep 17 00:00:00 2001 From: Simon Horman Date: Thu, 23 May 2013 21:02:52 +0000 Subject: MPLS: Add limited GSO support In the case where a non-MPLS packet is received and an MPLS stack is added it may well be the case that the original skb is GSO but the NIC used for transmit does not support GSO of MPLS packets. The aim of this code is to provide GSO in software for MPLS packets whose skbs are GSO. SKB Usage: When an implementation adds an MPLS stack to a non-MPLS packet it should do the following to skb metadata: * Set skb->inner_protocol to the old non-MPLS ethertype of the packet. skb->inner_protocol is added by this patch. * Set skb->protocol to the new MPLS ethertype of the packet. * Set skb->network_header to correspond to the end of the L3 header, including the MPLS label stack. I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to kernel" which adds MPLS support to the kernel datapath of Open vSwtich. That patch sets the above requirements in datapath/actions.c:push_mpls() and was used to exercise this code. The datapath patch is against the Open vSwtich tree but it is intended that it be added to the Open vSwtich code present in the mainline Linux kernel at some point. Features: I believe that the approach that I have taken is at least partially consistent with the handling of other protocols. Jesse, I understand that you have some ideas here. I am more than happy to change my implementation. This patch adds dev->mpls_features which may be used by devices to advertise features supported for MPLS packets. A new NETIF_F_MPLS_GSO feature is added for devices which support hardware MPLS GSO offload. Currently no devices support this and MPLS GSO always falls back to software. Alternate Implementation: One possible alternate implementation is to teach netif_skb_features() and skb_network_protocol() about MPLS, in a similar way to their understanding of VLANs. I believe this would avoid the need for net/mpls/mpls_gso.c and in particular the calls to __skb_push() and __skb_push() in mpls_gso_segment(). I have decided on the implementation in this patch as it should not introduce any overhead in the case where mpls_gso is not compiled into the kernel or inserted as a module. MPLS GSO suggested by Jesse Gross. Based in part on "v4 GRE: Add TCP segmentation offload for GRE" by Pravin B Shelar. Cc: Jesse Gross Cc: Pravin B Shelar Signed-off-by: Simon Horman Signed-off-by: David S. Miller --- include/linux/netdevice.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index ea7b6bce9ea0..6b2bb460d1d7 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1088,6 +1088,8 @@ struct net_device { * need to set them appropriately. */ netdev_features_t hw_enc_features; + /* mask of fetures inheritable by MPLS */ + netdev_features_t mpls_features; /* Interface index. Unique device identifier */ int ifindex; -- cgit v1.2.3 From 351638e7deeed2ec8ce451b53d33921b3da68f83 Mon Sep 17 00:00:00 2001 From: Jiri Pirko Date: Tue, 28 May 2013 01:30:21 +0000 Subject: net: pass info struct via netdevice notifier So far, only net_device * could be passed along with netdevice notifier event. This patch provides a possibility to pass custom structure able to provide info that event listener needs to know. Signed-off-by: Jiri Pirko v2->v3: fix typo on simeth shortened dev_getter shortened notifier_info struct name v1->v2: fix notifier_call parameter in call_netdevice_notifier() Signed-off-by: David S. Miller --- include/linux/netdevice.h | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 6b2bb460d1d7..13a34848b5e1 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1599,6 +1599,19 @@ struct packet_offload { extern int register_netdevice_notifier(struct notifier_block *nb); extern int unregister_netdevice_notifier(struct notifier_block *nb); + +struct netdev_notifier_info { + struct net_device *dev; +}; + +static inline struct net_device * +netdev_notifier_info_to_dev(const struct netdev_notifier_info *info) +{ + return info->dev; +} + +extern int call_netdevice_notifiers_info(unsigned long val, struct net_device *dev, + struct netdev_notifier_info *info); extern int call_netdevice_notifiers(unsigned long val, struct net_device *dev); -- cgit v1.2.3 From be9efd3653284f2827fd82861e8e9db9a8f726e1 Mon Sep 17 00:00:00 2001 From: Jiri Pirko Date: Tue, 28 May 2013 01:30:22 +0000 Subject: net: pass changed flags along with NETDEV_CHANGE event MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Use new netdevice notifier infrastructure to pass along changed flags. Signed-off-by: Timo Teräs Signed-off-by: Jiri Pirko v2->v3: shortened notifier_info struct name Signed-off-by: David S. Miller --- include/linux/netdevice.h | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 13a34848b5e1..850271809a9e 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1604,6 +1604,11 @@ struct netdev_notifier_info { struct net_device *dev; }; +struct netdev_notifier_change_info { + struct netdev_notifier_info info; /* must be first */ + unsigned int flags_changed; +}; + static inline struct net_device * netdev_notifier_info_to_dev(const struct netdev_notifier_info *info) { -- cgit v1.2.3 From 75538c2b85cf22eb9af6adfaf26ed7219025adeb Mon Sep 17 00:00:00 2001 From: Cong Wang Date: Wed, 29 May 2013 11:30:50 +0800 Subject: net: always pass struct netdev_notifier_info to netdevice notifiers commit 351638e7deeed2ec8ce451b53d3 (net: pass info struct via netdevice notifier) breaks booting of my KVM guest, this is due to we still forget to pass struct netdev_notifier_info in several places. This patch completes it. Cc: Jiri Pirko Cc: David S. Miller Signed-off-by: Cong Wang Signed-off-by: David S. Miller --- include/linux/netdevice.h | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 850271809a9e..8f967e34142b 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1609,6 +1609,12 @@ struct netdev_notifier_change_info { unsigned int flags_changed; }; +static inline void netdev_notifier_info_init(struct netdev_notifier_info *info, + struct net_device *dev) +{ + info->dev = dev; +} + static inline struct net_device * netdev_notifier_info_to_dev(const struct netdev_notifier_info *info) { -- cgit v1.2.3 From af12fa6e46aa651e7b86a4c4117b562518fef184 Mon Sep 17 00:00:00 2001 From: Eliezer Tamir Date: Mon, 10 Jun 2013 11:39:41 +0300 Subject: net: add napi_id and hash Adds a napi_id and a hashing mechanism to lookup a napi by id. This will be used by subsequent patches to implement low latency Ethernet device polling. Based on a code sample by Eric Dumazet. Signed-off-by: Eliezer Tamir Signed-off-by: Eric Dumazet Tested-by: Willem de Bruijn Signed-off-by: David S. Miller --- include/linux/netdevice.h | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 8f967e34142b..39bbd462d68e 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -324,12 +324,15 @@ struct napi_struct { struct sk_buff *gro_list; struct sk_buff *skb; struct list_head dev_list; + struct hlist_node napi_hash_node; + unsigned int napi_id; }; enum { NAPI_STATE_SCHED, /* Poll is scheduled */ NAPI_STATE_DISABLE, /* Disable pending */ NAPI_STATE_NPSVC, /* Netpoll - don't dequeue from poll_list */ + NAPI_STATE_HASHED, /* In NAPI hash */ }; enum gro_result { @@ -445,6 +448,32 @@ static inline bool napi_reschedule(struct napi_struct *napi) extern void __napi_complete(struct napi_struct *n); extern void napi_complete(struct napi_struct *n); +/** + * napi_by_id - lookup a NAPI by napi_id + * @napi_id: hashed napi_id + * + * lookup @napi_id in napi_hash table + * must be called under rcu_read_lock() + */ +extern struct napi_struct *napi_by_id(unsigned int napi_id); + +/** + * napi_hash_add - add a NAPI to global hashtable + * @napi: napi context + * + * generate a new napi_id and store a @napi under it in napi_hash + */ +extern void napi_hash_add(struct napi_struct *napi); + +/** + * napi_hash_del - remove a NAPI from global table + * @napi: napi context + * + * Warning: caller must observe rcu grace period + * before freeing memory containing @napi + */ +extern void napi_hash_del(struct napi_struct *napi); + /** * napi_disable - prevent NAPI from scheduling * @n: napi context -- cgit v1.2.3 From 060212928670593fb89243640bf05cf89560b023 Mon Sep 17 00:00:00 2001 From: Eliezer Tamir Date: Mon, 10 Jun 2013 11:39:50 +0300 Subject: net: add low latency socket poll Adds an ndo_ll_poll method and the code that supports it. This method can be used by low latency applications to busy-poll Ethernet device queues directly from the socket code. sysctl_net_ll_poll controls how many microseconds to poll. Default is zero (disabled). Individual protocol support will be added by subsequent patches. Signed-off-by: Alexander Duyck Signed-off-by: Jesse Brandeburg Signed-off-by: Eliezer Tamir Acked-by: Eric Dumazet Tested-by: Willem de Bruijn Signed-off-by: David S. Miller --- include/linux/netdevice.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 39bbd462d68e..2ecb96d9a1e5 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -971,6 +971,9 @@ struct net_device_ops { struct netpoll_info *info, gfp_t gfp); void (*ndo_netpoll_cleanup)(struct net_device *dev); +#endif +#ifdef CONFIG_NET_LL_RX_POLL + int (*ndo_ll_poll)(struct napi_struct *dev); #endif int (*ndo_set_vf_mac)(struct net_device *dev, int queue, u8 *mac); -- cgit v1.2.3 From dffebd2c5cd528a136b276a2a75c56222312d7a4 Mon Sep 17 00:00:00 2001 From: Narendra K Date: Mon, 10 Jun 2013 19:34:03 +0530 Subject: doc:networking: Update comment for dev_id field in netdevice.h This patch updates the comment for 'dev_id' field in 'include/linux/netdevice.h' to reflect the intended usage of 'dev_id'. References: http://marc.info/?l=linux-netdev&m=136992115300526&w=2 References: http://marc.info/?l=linux-netdev&m=137062569014612&w=2 Signed-off-by: Narendra K Reviewed-by: Ben Hutchings Signed-off-by: David S. Miller --- include/linux/netdevice.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 2ecb96d9a1e5..e5d65573b4d6 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1174,8 +1174,10 @@ struct net_device { unsigned char addr_assign_type; /* hw address assignment type */ unsigned char addr_len; /* hardware address length */ unsigned char neigh_priv_len; - unsigned short dev_id; /* for shared network cards */ - + unsigned short dev_id; /* Used to differentiate devices + * that share the same link + * layer address + */ spinlock_t addr_list_lock; struct netdev_hw_addr_list uc; /* Unicast mac addresses */ struct netdev_hw_addr_list mc; /* Multicast mac addresses */ -- cgit v1.2.3 From 5f121b9a83b499a61ed44e5ba619c7de8f7271ad Mon Sep 17 00:00:00 2001 From: Willem de Bruijn Date: Thu, 13 Jun 2013 15:29:38 -0400 Subject: net-rps: fixes for rps flow limit Caught by sparse: - __rcu: missing annotation to sd->flow_limit - __user: direct access in cpumask_scnprintf Also - add endline character when printing bitmap if room in buffer - avoid bucket overflow by reducing FLOW_LIMIT_HISTORY The last item warrants some explanation. The hashtable buckets are subject to overflow if FLOW_LIMIT_HISTORY is larger than or equal to bucket size, since all packets may end up in a single bucket. The current (rather arbitrary) history value of 256 happens to match the buffer size (u8). As a result, with a single flow, the first 128 packets are accepted (correct), the second 128 packets dropped (correct) and then the history[] array has filled, so that each subsequent new packet causes an increment in the bucket for new_flow plus a decrement for old_flow: a steady state. This is fine if packets are dropped, as the steady state goes away as soon as a mix of traffic reappears. But, because the 256th packet overflowed the bucket to 0: no packets are dropped. Instead of explicitly adding an overflow check, this patch changes FLOW_LIMIT_HISTORY to never be able to overflow a single bucket. Reported-by: Fengguang Wu (first item) Signed-off-by: Willem de Bruijn Signed-off-by: David S. Miller --- include/linux/netdevice.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index e5d65573b4d6..8c9fcc42502a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1840,7 +1840,7 @@ static inline int unregister_gifconf(unsigned int family) } #ifdef CONFIG_NET_FLOW_LIMIT -#define FLOW_LIMIT_HISTORY (1 << 8) /* must be ^2 */ +#define FLOW_LIMIT_HISTORY (1 << 7) /* must be ^2 and !overflow buckets */ struct sd_flow_limit { u64 count; unsigned int num_buckets; @@ -1883,7 +1883,7 @@ struct softnet_data { struct napi_struct backlog; #ifdef CONFIG_NET_FLOW_LIMIT - struct sd_flow_limit *flow_limit; + struct sd_flow_limit __rcu *flow_limit; #endif }; -- cgit v1.2.3 From 1d8faf48c74b8329a0322dc4b2a2030ae5003c86 Mon Sep 17 00:00:00 2001 From: Rony Efraim Date: Thu, 13 Jun 2013 13:19:10 +0300 Subject: net/core: Add VF link state control Add netlink directives and ndo entry to allow for controling VF link, which can be in one of three states: Auto - VF link state reflects the PF link state (default) Up - VF link state is up, traffic from VF to VF works even if the actual PF link is down Down - VF link state is down, no traffic from/to this VF, can be of use while configuring the VF Signed-off-by: Rony Efraim Signed-off-by: Or Gerlitz Signed-off-by: David S. Miller --- include/linux/netdevice.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 8c9fcc42502a..09b4188c1ea7 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -829,6 +829,7 @@ struct netdev_fcoe_hbainfo { * int (*ndo_set_vf_spoofchk)(struct net_device *dev, int vf, bool setting); * int (*ndo_get_vf_config)(struct net_device *dev, * int vf, struct ifla_vf_info *ivf); + * int (*ndo_set_vf_link_state)(struct net_device *dev, int vf, int link_state); * int (*ndo_set_vf_port)(struct net_device *dev, int vf, * struct nlattr *port[]); * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb); @@ -986,6 +987,8 @@ struct net_device_ops { int (*ndo_get_vf_config)(struct net_device *dev, int vf, struct ifla_vf_info *ivf); + int (*ndo_set_vf_link_state)(struct net_device *dev, + int vf, int link_state); int (*ndo_set_vf_port)(struct net_device *dev, int vf, struct nlattr *port[]); -- cgit v1.2.3 From 8b80cda536ea9bceec0364e897868a30ee13b992 Mon Sep 17 00:00:00 2001 From: Eliezer Tamir Date: Wed, 10 Jul 2013 17:13:26 +0300 Subject: net: rename ll methods to busy-poll Rename ndo_ll_poll to ndo_busy_poll. Rename sk_mark_ll to sk_mark_napi_id. Rename skb_mark_ll to skb_mark_napi_id. Correct all useres of these functions. Update comments and defines in include/net/busy_poll.h Signed-off-by: Eliezer Tamir Signed-off-by: David S. Miller --- include/linux/netdevice.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index bb82871b8494..0741a1e919a5 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -974,7 +974,7 @@ struct net_device_ops { void (*ndo_netpoll_cleanup)(struct net_device *dev); #endif #ifdef CONFIG_NET_LL_RX_POLL - int (*ndo_ll_poll)(struct napi_struct *dev); + int (*ndo_busy_poll)(struct napi_struct *dev); #endif int (*ndo_set_vf_mac)(struct net_device *dev, int queue, u8 *mac); -- cgit v1.2.3 From e0d1095ae3405404d247afb00233ef837d58da83 Mon Sep 17 00:00:00 2001 From: Cong Wang Date: Thu, 1 Aug 2013 11:10:25 +0800 Subject: net: rename CONFIG_NET_LL_RX_POLL to CONFIG_NET_RX_BUSY_POLL Eliezer renames several *ll_poll to *busy_poll, but forgets CONFIG_NET_LL_RX_POLL, so in case of confusion, rename it too. Cc: Eliezer Tamir Cc: David S. Miller Signed-off-by: Cong Wang Signed-off-by: David S. Miller --- include/linux/netdevice.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 0741a1e919a5..9a4156845e93 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -973,7 +973,7 @@ struct net_device_ops { gfp_t gfp); void (*ndo_netpoll_cleanup)(struct net_device *dev); #endif -#ifdef CONFIG_NET_LL_RX_POLL +#ifdef CONFIG_NET_RX_BUSY_POLL int (*ndo_busy_poll)(struct napi_struct *dev); #endif int (*ndo_set_vf_mac)(struct net_device *dev, -- cgit v1.2.3