From e04c5d76b0cfb66cadd900cf147526f2271884b8 Mon Sep 17 00:00:00 2001
From: Marcelo Tosatti <mtosatti@redhat.com>
Date: Wed, 10 Jul 2013 22:21:57 -0300
Subject: remove sched notifier for cross-cpu migrations

Linux as a guest on KVM hypervisor, the only user of the pvclock
vsyscall interface, does not require notification on task migration
because:

1. cpu ID number maps 1:1 to per-CPU pvclock time info.
2. per-CPU pvclock time info is updated if the
   underlying CPU changes.
3. that version is increased whenever underlying CPU
   changes.

Which is sufficient to guarantee nanoseconds counter
is calculated properly.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 include/linux/sched.h | 8 --------
 1 file changed, 8 deletions(-)

(limited to 'include/linux/sched.h')

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 50d04b92ceda..bfc809d51745 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -107,14 +107,6 @@ extern unsigned long this_cpu_load(void);
 extern void calc_global_load(unsigned long ticks);
 extern void update_cpu_load_nohz(void);
 
-/* Notifier for when a task gets migrated to a new CPU */
-struct task_migration_notifier {
-	struct task_struct *task;
-	int from_cpu;
-	int to_cpu;
-};
-extern void register_task_migration_notifier(struct notifier_block *n);
-
 extern unsigned long get_parent_ip(unsigned long addr);
 
 extern void dump_cpu_task(int cpu);
-- 
cgit v1.2.3


From 62470419e993f8d9d93db0effd3af4296ecb79a5 Mon Sep 17 00:00:00 2001
From: Michael Wang <wangyun@linux.vnet.ibm.com>
Date: Thu, 4 Jul 2013 12:55:51 +0800
Subject: sched: Implement smarter wake-affine logic

The wake-affine scheduler feature is currently always trying to pull
the wakee close to the waker. In theory this should be beneficial if
the waker's CPU caches hot data for the wakee, and it's also beneficial
in the extreme ping-pong high context switch rate case.

Testing shows it can benefit hackbench up to 15%.

However, the feature is somewhat blind, from which some workloads
such as pgbench suffer. It's also time-consuming algorithmically.

Testing shows it can damage pgbench up to 50% - far more than the
benefit it brings in the best case.

So wake-affine should be smarter and it should realize when to
stop its thankless effort at trying to find a suitable CPU to wake on.

This patch introduces 'wakee_flips', which will be increased each
time the task flips (switches) its wakee target.

So a high 'wakee_flips' value means the task has more than one
wakee, and the bigger the number, the higher the wakeup frequency.

Now when making the decision on whether to pull or not, pay attention to
the wakee with a high 'wakee_flips', pulling such a task may benefit
the wakee. Also imply that the waker will face cruel competition later,
it could be very cruel or very fast depends on the story behind
'wakee_flips', waker therefore suffers.

Furthermore, if waker also has a high 'wakee_flips', that implies that
multiple tasks rely on it, then waker's higher latency will damage all
of them, so pulling wakee seems to be a bad deal.

Thus, when 'waker->wakee_flips / wakee->wakee_flips' becomes
higher and higher, the cost of pulling seems to be worse and worse.

The patch therefore helps the wake-affine feature to stop its pulling
work when:

	wakee->wakee_flips > factor &&
	waker->wakee_flips > (factor * wakee->wakee_flips)

The 'factor' here is the number of CPUs in the current CPU's NUMA node,
so a bigger node will lead to more pulling since the trial becomes more
severe.

After applying the patch, pgbench shows up to 40% improvements and no regressions.

Tested with 12 cpu x86 server and tip 3.10.0-rc7.

The percentages in the final column highlight the areas with the biggest wins,
all other areas improved as well:

	pgbench		    base	smart

	| db_size | clients |  tps  |	|  tps  |
	+---------+---------+-------+   +-------+
	| 22 MB   |       1 | 10598 |   | 10796 |
	| 22 MB   |       2 | 21257 |   | 21336 |
	| 22 MB   |       4 | 41386 |   | 41622 |
	| 22 MB   |       8 | 51253 |   | 57932 |
	| 22 MB   |      12 | 48570 |   | 54000 |
	| 22 MB   |      16 | 46748 |   | 55982 | +19.75%
	| 22 MB   |      24 | 44346 |   | 55847 | +25.93%
	| 22 MB   |      32 | 43460 |   | 54614 | +25.66%
	| 7484 MB |       1 |  8951 |   |  9193 |
	| 7484 MB |       2 | 19233 |   | 19240 |
	| 7484 MB |       4 | 37239 |   | 37302 |
	| 7484 MB |       8 | 46087 |   | 50018 |
	| 7484 MB |      12 | 42054 |   | 48763 |
	| 7484 MB |      16 | 40765 |   | 51633 | +26.66%
	| 7484 MB |      24 | 37651 |   | 52377 | +39.11%
	| 7484 MB |      32 | 37056 |   | 51108 | +37.92%
	| 15 GB   |       1 |  8845 |   |  9104 |
	| 15 GB   |       2 | 19094 |   | 19162 |
	| 15 GB   |       4 | 36979 |   | 36983 |
	| 15 GB   |       8 | 46087 |   | 49977 |
	| 15 GB   |      12 | 41901 |   | 48591 |
	| 15 GB   |      16 | 40147 |   | 50651 | +26.16%
	| 15 GB   |      24 | 37250 |   | 52365 | +40.58%
	| 15 GB   |      32 | 36470 |   | 50015 | +37.14%

Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51D50057.9000809@linux.vnet.ibm.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h | 3 +++
 1 file changed, 3 insertions(+)

(limited to 'include/linux/sched.h')

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 50d04b92ceda..4f163a8ffabf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1034,6 +1034,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
+	struct task_struct *last_wakee;
+	unsigned long wakee_flips;
+	unsigned long wakee_flip_decay_ts;
 #endif
 	int on_rq;
 
-- 
cgit v1.2.3


From e1403b8edf669ff49bbdf602cc97fefa2760cb15 Mon Sep 17 00:00:00 2001
From: Oleg Nesterov <oleg@redhat.com>
Date: Wed, 11 Sep 2013 14:20:06 -0700
Subject: include/linux/sched.h: don't use task->pid/tgid in
 same_thread_group/has_group_leader_pid

task_struct->pid/tgid should go away.

1. Change same_thread_group() to use task->signal for comparison.

2. Change has_group_leader_pid(task) to compare task_pid(task) with
   signal->leader_pid.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Sergey Dyasly <dserrg@gmail.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/sched.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

(limited to 'include/linux/sched.h')

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ce1e1c0aaa33..45f254dddafc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2169,15 +2169,15 @@ static inline bool thread_group_leader(struct task_struct *p)
  * all we care about is that we have a task with the appropriate
  * pid, we don't actually care if we have the right task.
  */
-static inline int has_group_leader_pid(struct task_struct *p)
+static inline bool has_group_leader_pid(struct task_struct *p)
 {
-	return p->pid == p->tgid;
+	return task_pid(p) == p->signal->leader_pid;
 }
 
 static inline
-int same_thread_group(struct task_struct *p1, struct task_struct *p2)
+bool same_thread_group(struct task_struct *p1, struct task_struct *p2)
 {
-	return p1->tgid == p2->tgid;
+	return p1->signal == p2->signal;
 }
 
 static inline struct task_struct *next_thread(const struct task_struct *p)
-- 
cgit v1.2.3


From 519e52473ebe9db5cdef44670d5a97f1fd53d721 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 12 Sep 2013 15:13:42 -0700
Subject: mm: memcg: enable memcg OOM killer only for user faults

System calls and kernel faults (uaccess, gup) can handle an out of memory
situation gracefully and just return -ENOMEM.

Enable the memcg OOM killer only for user faults, where it's really the
only option available.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/sched.h | 3 +++
 1 file changed, 3 insertions(+)

(limited to 'include/linux/sched.h')

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 45f254dddafc..9ce1fa53031f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1393,6 +1393,9 @@ struct task_struct {
 		unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
 	} memcg_batch;
 	unsigned int memcg_kmem_skip_account;
+	struct memcg_oom_info {
+		unsigned int may_oom:1;
+	} memcg_oom;
 #endif
 #ifdef CONFIG_UPROBES
 	struct uprobe_task *utask;
-- 
cgit v1.2.3


From 3812c8c8f3953921ef18544110dafc3505c1ac62 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 12 Sep 2013 15:13:44 -0700
Subject: mm: memcg: do not trap chargers with full callstack on OOM

The memcg OOM handling is incredibly fragile and can deadlock.  When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds.  Comparably, any other task
that enters the charge path at this point will go to a waitqueue right
then and there and sleep until the OOM situation is resolved.  The problem
is that these tasks may hold filesystem locks and the mmap_sem; locks that
the selected OOM victim may need to exit.

For example, in one reported case, the task invoking the OOM killer was
about to charge a page cache page during a write(), which holds the
i_mutex.  The OOM killer selected a task that was just entering truncate()
and trying to acquire the i_mutex:

OOM invoking task:
  mem_cgroup_handle_oom+0x241/0x3b0
  mem_cgroup_cache_charge+0xbe/0xe0
  add_to_page_cache_locked+0x4c/0x140
  add_to_page_cache_lru+0x22/0x50
  grab_cache_page_write_begin+0x8b/0xe0
  ext3_write_begin+0x88/0x270
  generic_file_buffered_write+0x116/0x290
  __generic_file_aio_write+0x27c/0x480
  generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
  do_sync_write+0xea/0x130
  vfs_write+0xf3/0x1f0
  sys_write+0x51/0x90
  system_call_fastpath+0x18/0x1d

OOM kill victim:
  do_truncate+0x58/0xa0              # takes i_mutex
  do_last+0x250/0xa30
  path_openat+0xd7/0x440
  do_filp_open+0x49/0xa0
  do_sys_open+0x106/0x240
  sys_open+0x20/0x30
  system_call_fastpath+0x18/0x1d

The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.

A similar scenario can happen when the kernel OOM killer for a memcg is
disabled and a userspace task is in charge of resolving OOM situations.
In this case, ALL tasks that enter the OOM path will be made to sleep on
the OOM waitqueue and wait for userspace to free resources or increase
the group's limit.  But a userspace OOM handler is prone to deadlock
itself on the locks held by the waiting tasks.  For example one of the
sleeping tasks may be stuck in a brk() call with the mmap_sem held for
writing but the userspace handler, in order to pick an optimal victim,
may need to read files from /proc/<pid>, which tries to acquire the same
mmap_sem for reading and deadlocks.

This patch changes the way tasks behave after detecting a memcg OOM and
makes sure nobody loops or sleeps with locks held:

1. When OOMing in a user fault, invoke the OOM killer and restart the
   fault instead of looping on the charge attempt.  This way, the OOM
   victim can not get stuck on locks the looping task may hold.

2. When OOMing in a user fault but somebody else is handling it
   (either the kernel OOM killer or a userspace handler), don't go to
   sleep in the charge context.  Instead, remember the OOMing memcg in
   the task struct and then fully unwind the page fault stack with
   -ENOMEM.  pagefault_out_of_memory() will then call back into the
   memcg code to check if the -ENOMEM came from the memcg, and then
   either put the task to sleep on the memcg's OOM waitqueue or just
   restart the fault.  The OOM victim can no longer get stuck on any
   lock a sleeping task may hold.

Debugged by Michal Hocko.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: azurIt <azurit@pobox.sk>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/sched.h | 4 ++++
 1 file changed, 4 insertions(+)

(limited to 'include/linux/sched.h')

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9ce1fa53031f..6682da36b293 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1395,6 +1395,10 @@ struct task_struct {
 	unsigned int memcg_kmem_skip_account;
 	struct memcg_oom_info {
 		unsigned int may_oom:1;
+		unsigned int in_memcg_oom:1;
+		unsigned int oom_locked:1;
+		int wakeups;
+		struct mem_cgroup *wait_on_memcg;
 	} memcg_oom;
 #endif
 #ifdef CONFIG_UPROBES
-- 
cgit v1.2.3