summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2018-01-19mm: larger stack guard gap, between vmasSri Krishna chowdary
commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream. Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Bug 1946430 Change-Id: I9a66aabc34b687996fb971e01bb0ef30a3d4de7d Original-patch-by: Oleg Nesterov <oleg@redhat.com> Original-patch-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [wt: backport to 4.11: adjust context] [wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide] [wt: backport to 4.4: adjust context ; drop ppc hugetlb_radix changes] Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sri Krishna chowdary <schowdary@nvidia.com> Reviewed-on: https://git-master.nvidia.com/r/1509390 GVS: Gerrit_Virtual_Submit Tested-by: Bibek Basu <bbasu@nvidia.com> Reviewed-by: Bibek Basu <bbasu@nvidia.com>
2016-10-26mm: remove gup_flags FOLL_WRITE games from __get_user_pages()Linus Torvalds
commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619 upstream. This is an ancient bug that was actually attempted to be fixed once (badly) by me eleven years ago in commit 4ceb5db9757a ("Fix get_user_pages() race for write access") but that was then undone due to problems on s390 by commit f33ea7f404e5 ("fix get_user_pages bug"). In the meantime, the s390 situation has long been fixed, and we can now fix it by checking the pte_dirty() bit properly (and do it better). The s390 dirty bit was implemented in abf09bed3cce ("s390/mm: implement software dirty bits") which made it into v3.9. Earlier kernels will have to look at the page state itself. Also, the VM has become more scalable, and what used a purely theoretical race back then has become easier to trigger. To fix it, we introduce a new internal FOLL_COW flag to mark the "yes, we already did a COW" rather than play racy games with FOLL_WRITE that is very fundamental, and then use the pte dirty flag to validate that the FOLL_COW flag is still valid. Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Nick Piggin <npiggin@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [wt: s/gup.c/memory.c; s/follow_page_pte/follow_page_mask; s/faultin_page/__get_user_page] Signed-off-by: Willy Tarreau <w@1wt.eu> Change-Id: I6fbb1abf656ff7e05ec4c65f07dbbdd694546fb4 Signed-off-by: Krishna Reddy <vdumpa@nvidia.com> Signed-off-by: Sumit Gupta <sumitg@nvidia.com> Reviewed-on: http://git-master/r/1241321 GVS: Gerrit_Virtual_Submit Reviewed-by: Bibek Basu <bbasu@nvidia.com> Tested-by: Bibek Basu <bbasu@nvidia.com>
2014-11-05mm: compaction: don't restrict page isolation during CMA page migrationKrishna Reddy
don't limit the number of pages isolated during CMA page migration. Bug 1550455 Change-Id: Ib6edcb090b30212302543098a05b85e669ade45d Signed-off-by: Krishna Reddy <vdumpa@nvidia.com> Reviewed-on: http://git-master/r/495283 (cherry picked from commit ec9ed2b5c4418c658fe2a3b00b0baf6179b3b452) Reviewed-on: http://git-master/r/592898 GVS: Gerrit_Virtual_Submit
2014-11-05mm: avoid allocating CMA memory for stackKrishna Reddy
allocation of CMA memory for user stack would cause permanent migration failures and result in CMA allocation failures. Bug 1550455 Change-Id: I75ac13416dbcf1810c89641cefdd0d56726cc36a Signed-off-by: Krishna Reddy <vdumpa@nvidia.com> Reviewed-on: http://git-master/r/495322 (cherry picked from commit 3c59fb443605d5975b9d8e250c4ca52ae1650fe5) Reviewed-on: http://git-master/r/592897 GVS: Gerrit_Virtual_Submit
2014-06-26mm: Add NULL check before de-referencing vmaBharat Nihalani
This prevents the following crash seen during boot-up: Unable to handle kernel NULL pointer dereference at virtual address 00000041 pgd = ffffffc037f68000 [00000041] *pgd=0000000000000000 Internal error: Oops: 96000005 [#1] PREEMPT SMP Modules linked in: inv_bmp180 inv_ak8975 inv_mpu CPU: 0 PID: 1457 Comm: IntentService[F Not tainted 3.10.40-g4abcb3f #1 task: ffffffc0393e0080 ti: ffffffc0393e8000 task.ti: ffffffc0393e8000 PC is at follow_page_mask+0x1c/0x378 LR is at __get_user_pages.part.88+0x124/0x700 pc : [<ffffffc00016c314>] lr : [<ffffffc00016e1d4>] pstate: 40000045 Bug 1525355 Change-Id: Ieed6942d7beb32964484f97d5cc671b42c4b60cb Signed-off-by: Bharat Nihalani <bnihalani@nvidia.com> Reviewed-on: http://git-master/r/427723 Tested-by: Vandana Salve <vsalve@nvidia.com> Reviewed-by: Krishna Reddy <vdumpa@nvidia.com>
2014-06-18oom_kill: add rcu_read_lock() into find_lock_task_mm()Oleg Nesterov
find_lock_task_mm() expects it is called under rcu or tasklist lock, but it seems that at least oom_unkillable_task()->task_in_mem_cgroup() and mem_cgroup_out_of_memory()->oom_badness() can call it lockless. Perhaps we could fix the callers, but this patch simply adds rcu lock into find_lock_task_mm(). This also allows to simplify a bit one of its callers, oom_kill_process(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Sergey Dyasly <dserrg@gmail.com> Cc: Sameer Nanda <snanda@chromium.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: "Ma, Xindong" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: "Tu, Xiaobing" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I5f214dec13b34e05c4b5fc8bc29df3ab7400efa1 Reviewed-on: http://git-master/r/421705 Reviewed-by: Sri Krishna Chowdary <schowdary@nvidia.com> GVS: Gerrit_Virtual_Submit Reviewed-by: Bharat Nihalani <bnihalani@nvidia.com> Tested-by: Kerwin Wan <kerwinw@nvidia.com>
2014-06-18oom_kill: has_intersects_mems_allowed() needs rcu_read_lock()Oleg Nesterov
At least out_of_memory() calls has_intersects_mems_allowed() without even rcu_read_lock(), this is obviously buggy. Add the necessary rcu_read_lock(). This means that we can not simply return from the loop, we need "bool ret" and "break". While at it, swap the names of task_struct's (the argument and the local). This cleans up the code a little bit and avoids the unnecessary initialization. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sergey Dyasly <dserrg@gmail.com> Tested-by: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: Sameer Nanda <snanda@chromium.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: "Ma, Xindong" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: "Tu, Xiaobing" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: Iee512a4e3446b1ec8fb7fcc434f1cf18a13a5645 Reviewed-on: http://git-master/r/421704 Reviewed-by: Sri Krishna Chowdary <schowdary@nvidia.com> GVS: Gerrit_Virtual_Submit Reviewed-by: Bharat Nihalani <bnihalani@nvidia.com> Tested-by: Kerwin Wan <kerwinw@nvidia.com>
2014-06-11mm: get_user_pages: migrate out CMA pages when FOLL_DURABLE flag is setVandana Salve
When __get_user_pages() is called with FOLL_DURABLE flag, ensure that no page in CMA pageblocks gets locked. This workarounds the permanent migration failures caused by locking the pages by get_user_pages() call for a long period of time. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> bug 1517584 Change-Id: I11b7c87e78f1022d6fded85a1ed6bac73c5f0a7c Signed-off-by: Vandana Salve <vsalve@nvidia.com> Reviewed-on: http://git-master/r/421678 Reviewed-by: Krishna Reddy <vdumpa@nvidia.com> Reviewed-by: Bharat Nihalani <bnihalani@nvidia.com>
2014-06-11mm: get_user_pages: use NON-MOVABLE pages when FOLL_DURABLE flag is setVandana Salve
Ensure that newly allocated pages, which are faulted in in FOLL_DURABLE mode comes from non-movalbe pageblocks, to workaround migration failures with CMA Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> bug 1517584 Change-Id: I76d2185cc7e77992db585a71efaa06a5c0105a76 Signed-off-by: Vandana Salve <vsalve@nvidia.com> Reviewed-on: http://git-master/r/421677 Reviewed-by: Riham Haidar <rhaidar@nvidia.com> Tested-by: Riham Haidar <rhaidar@nvidia.com>
2014-06-11mm: get_user_pages: use static inlineVandana Salve
__get_user_pages() is already exported function, so get_user_pages() can be easily inlined to the caller functions. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> bug 1517584 Change-Id: If700fa3c6ead133299fa99a702887584b76e5ffb Signed-off-by: Vandana Salve <vsalve@nvidia.com> Reviewed-on: http://git-master/r/421676 Reviewed-by: Riham Haidar <rhaidar@nvidia.com> Tested-by: Riham Haidar <rhaidar@nvidia.com>
2014-06-11mm: introduce migrate_replace_page() for migrating page to the given targetVandana Salve
introduce migrate_replace_page for migrating page to the given target Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> bug 1517584 Change-Id: I5f1d3bcb19ca7d9c9cf7234e8d3472a42c4f40af Signed-off-by: Vandana Salve <vsalve@nvidia.com> Reviewed-on: http://git-master/r/421675 Reviewed-by: Krishna Reddy <vdumpa@nvidia.com> Reviewed-by: Bharat Nihalani <bnihalani@nvidia.com>
2014-06-10Merge commit 'refs/changes/82/419382/1' of ssh://git-master:12001/linux-3.10 ↵Mandar padmawar
into promotion_build Change-Id: I9418a05ad5c56b2e902249218bac2fa594d99f56
2014-06-05oom_kill: change oom_kill.c to use for_each_thread()Oleg Nesterov
Change oom_kill.c to use for_each_thread() rather than the racy while_each_thread() which can loop forever if we race with exit. Note also that most users were buggy even if while_each_thread() was fine, the task can exit even _before_ rcu_read_lock(). Fortunately the new for_each_thread() only requires the stable task_struct, so this change fixes both problems. Bug 200004307 Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sergey Dyasly <dserrg@gmail.com> Tested-by: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: Sameer Nanda <snanda@chromium.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: "Ma, Xindong" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: "Tu, Xiaobing" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 1da4db0cd5c8a31d4468ec906b413e75e604b465) Change-Id: Iecf2804ab94f17adf1bad6d0b2ddc560e66544e6 Signed-off-by: Sri Krishna chowdary <schowdary@nvidia.com> Reviewed-on: http://git-master/r/418860 Reviewed-by: Krishna Reddy <vdumpa@nvidia.com> Reviewed-by: Bharat Nihalani <bnihalani@nvidia.com>
2014-06-05Merge branch 'linux-3.10.40' into rel-21Ishan Mittal
Bug 200004122 Conflicts: drivers/cpufreq/cpufreq.c drivers/regulator/core.c sound/soc/codecs/max98090.c Change-Id: I9418a05ad5c56b2e902249218bac2fa594d99f56 Signed-off-by: Ishan Mittal <imittal@nvidia.com>
2014-05-16mm: vmalloc: skip unneeded TLB flushes during lazy vfreeRich Wiley
currently, __purge_vmap_area_lazy will do a TLBI on the smallest single range that includes all purgable vmap_areas. This patch will do TLBIs per vmap_area, reducing the number of invalidated pages. Change-Id: I40b9c8c533e6814be076d454ef7ebdbe408c08e0 Signed-off-by: Rich Wiley <rwiley@nvidia.com> Reviewed-on: http://git-master/r/407748 Reviewed-by: Alexander Van Brunt <avanbrunt@nvidia.com>
2014-05-06mm: hugetlb: fix softlockup when a large number of hugepages are freed.Mizuma, Masayoshi
commit 55f67141a8927b2be3e51840da37b8a2320143ed upstream. When I decrease the value of nr_hugepage in procfs a lot, softlockup happens. It is because there is no chance of context switch during this process. On the other hand, when I allocate a large number of hugepages, there is some chance of context switch. Hence softlockup doesn't happen during this process. So it's necessary to add the context switch in the freeing process as same as allocating process to avoid softlockup. When I freed 12 TB hugapages with kernel-2.6.32-358.el6, the freeing process occupied a CPU over 150 seconds and following softlockup message appeared twice or more. $ echo 6000000 > /proc/sys/vm/nr_hugepages $ cat /proc/sys/vm/nr_hugepages 6000000 $ grep ^Huge /proc/meminfo HugePages_Total: 6000000 HugePages_Free: 6000000 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB $ echo 0 > /proc/sys/vm/nr_hugepages BUG: soft lockup - CPU#16 stuck for 67s! [sh:12883] ... Pid: 12883, comm: sh Not tainted 2.6.32-358.el6.x86_64 #1 Call Trace: free_pool_huge_page+0xb8/0xd0 set_max_huge_pages+0x128/0x190 hugetlb_sysctl_handler_common+0x113/0x140 hugetlb_sysctl_handler+0x1e/0x20 proc_sys_call_handler+0x97/0xd0 proc_sys_write+0x14/0x20 vfs_write+0xb8/0x1a0 sys_write+0x51/0x90 __audit_syscall_exit+0x265/0x290 system_call_fastpath+0x16/0x1b I have not confirmed this problem with upstream kernels because I am not able to prepare the machine equipped with 12TB memory now. However I confirmed that the amount of decreasing hugepages was directly proportional to the amount of required time. I measured required times on a smaller machine. It showed 130-145 hugepages decreased in a millisecond. Amount of decreasing Required time Decreasing rate hugepages (msec) (pages/msec) ------------------------------------------------------------ 10,000 pages == 20GB 70 - 74 135-142 30,000 pages == 60GB 208 - 229 131-144 It means decrement of 6TB hugepages will trigger softlockup with the default threshold 20sec, in this decreasing rate. Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06mm: try_to_unmap_cluster() should lock_page() before mlockingVlastimil Babka
commit 57e68e9cd65b4b8eb4045a1e0d0746458502554c upstream. A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin fuzzing with trinity. The call site try_to_unmap_cluster() does not lock the pages other than its check_page parameter (which is already locked). The BUG_ON in mlock_vma_page() is not documented and its purpose is somewhat unclear, but apparently it serializes against page migration, which could otherwise fail to transfer the PG_mlocked flag. This would not be fatal, as the page would be eventually encountered again, but NR_MLOCK accounting would become distorted nevertheless. This patch adds a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that effect. The call site try_to_unmap_cluster() is fixed so that for page != check_page, trylock_page() is attempted (to avoid possible deadlocks as we already have check_page locked) and mlock_vma_page() is performed only upon success. If the page lock cannot be obtained, the page is left without PG_mlocked, which is again not a problem in the whole unevictable memory design. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Bob Liu <bob.liu@oracle.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Michel Lespinasse <walken@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26bdi: avoid oops on device removalJan Kara
commit 5acda9d12dcf1ad0d9a5a2a7c646de3472fa7555 upstream. After commit 839a8e8660b6 ("writeback: replace custom worker pool implementation with unbound workqueue") when device is removed while we are writing to it we crash in bdi_writeback_workfn() -> set_worker_desc() because bdi->dev is NULL. This can happen because even though bdi_unregister() cancels all pending flushing work, nothing really prevents new ones from being queued from balance_dirty_pages() or other places. Fix the problem by clearing BDI_registered bit in bdi_unregister() and checking it before scheduling of any flushing work. Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977 Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Cc: Derek Basehore <dbasehore@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26backing_dev: fix hung task on syncDerek Basehore
commit 6ca738d60c563d5c6cf6253ee4b8e76fa77b2b9e upstream. bdi_wakeup_thread_delayed() used the mod_delayed_work() function to schedule work to writeback dirty inodes. The problem with this is that it can delay work that is scheduled for immediate execution, such as the work from sync_inodes_sb(). This can happen since mod_delayed_work() can now steal work from a work_queue. This fixes the problem by using queue_delayed_work() instead. This is a regression caused by commit 839a8e8660b6 ("writeback: replace custom worker pool implementation with unbound workqueue"). The reason that this causes a problem is that laptop-mode will change the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default. In the case that bdi_wakeup_thread_delayed() races with sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung task. Even if dirty_writeback_centisecs is not long enough to cause a hung task, we still don't want to delay sync for that long. We fix the problem by using queue_delayed_work() when we want to schedule writeback sometime in future. This function doesn't change the timer if it is already armed. For the same reason, we also change bdi_writeback_workfn() to immediately queue the work again in the case that the work_list is not empty. The same problem can happen if the sync work is run on the rescue worker. [jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()] Signed-off-by: Derek Basehore <dbasehore@chromium.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zento.linux.org.uk> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Derek Basehore <dbasehore@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Benson Leung <bleung@chromium.org> Cc: Sonny Rao <sonnyrao@chromium.org> Cc: Luigi Semenzato <semenzato@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-03mm: close PageTail raceDavid Rientjes
commit 668f9abbd4334e6c29fa8acd71635c4f9101caa7 upstream. Commit bf6bddf1924e ("mm: introduce compaction and migration for ballooned pages") introduces page_count(page) into memory compaction which dereferences page->first_page if PageTail(page). This results in a very rare NULL pointer dereference on the aforementioned page_count(page). Indeed, anything that does compound_head(), including page_count() is susceptible to racing with prep_compound_page() and seeing a NULL or dangling page->first_page pointer. This patch uses Andrea's implementation of compound_trans_head() that deals with such a race and makes it the default compound_head() implementation. This includes a read memory barrier that ensures that if PageTail(head) is true that we return a head page that is neither NULL nor dangling. The patch then adds a store memory barrier to prep_compound_page() to ensure page->first_page is set. This is the safest way to ensure we see the head page that we are expecting, PageTail(page) is already in the unlikely() path and the memory barriers are unfortunately required. Hugetlbfs is the exception, we don't enforce a store memory barrier during init since no race is possible. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Holger Kiehl <Holger.Kiehl@dwd.de> Cc: Christoph Lameter <cl@linux.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-31mmap: arch_get_unmapped_area(): use proper mmap base for bottom up directionHeiko Carstens
This is more or less the generic variant of commit 41aacc1eea64 ("x86 get_unmapped_area: Access mmap_legacy_base through mm_struct member"). So effectively architectures which use an own arch_pick_mmap_layout() implementation but call the generic arch_get_unmapped_area() now can also randomize their mmap_base. All architectures which have an own arch_pick_mmap_layout() and call the generic arch_get_unmapped_area() (arm64, s390, tile) currently set mmap_base to TASK_UNMAPPED_BASE. This is also true for the generic arch_pick_mmap_layout() function. So this change is a no-op currently. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Radu Caragea <sinaelgl@gmail.com> Cc: Michel Lespinasse <walken@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug 1486057 Cherry-pick upstream patch 4e99b02131b280b064d30a5926ef1c4763f3097b Change-Id: I027ce31f1b9e87c2953bfa0a9d82288d92ba11fd Signed-off-by: Li Li <lli5@nvidia.com> Reviewed-on: http://git-master/r/387939 Reviewed-by: Eric Miao <emiao@nvidia.com> Reviewed-by: Peter Zu <pzu@nvidia.com> Reviewed-by: Bharat Nihalani <bnihalani@nvidia.com>
2014-03-25mm: mprotect: prevent unneeded TLB flushes from mprotect syscallRich Wiley
change_protection_range will only flush the tlb if change_pte_range reports that it has actually changed the permissions of at least one page. This patch prevents change_pte_range from counting pages that it doesn't actually modify. Change-Id: I53b9b2c7a635ba1395200b9ff70f6f40a053f987 Signed-off-by: Rich Wiley <rwiley@nvidia.com> Reviewed-on: http://git-master/r/385239 Reviewed-by: Alexander Van Brunt <avanbrunt@nvidia.com>
2014-03-23memcg: reparent charges of children before processing parentFilipe Brandenburger
commit 4fb1a86fb5e4209a7d4426d4e586c58e9edc74ac upstream. Sometimes the cleanup after memcg hierarchy testing gets stuck in mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. There may turn out to be several causes, but a major cause is this: the workitem to offline parent can get run before workitem to offline child; parent's mem_cgroup_reparent_charges() circles around waiting for the child's pages to be reparented to its lrus, but it's holding cgroup_mutex which prevents the child from reaching its mem_cgroup_reparent_charges(). Further testing showed that an ordered workqueue for cgroup_destroy_wq is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched stage on the way can mess up the order before reaching the workqueue. Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on all its children (and grandchildren, in the correct order) to have their charges reparented first. [The version for 3.10.34 (or perhaps now 3.10.35) is this below. Yes, more differences, and the old mem_cgroup_reparent_charges line is intentionally left in for 3.10 whereas it was removed for 3.12+: that's because the css/cgroup iterator changed in between, it used not to supply the root of the subtree, but nowadays it does - Hugh] Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction") Signed-off-by: Filipe Brandenburger <filbranden@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23mm/compaction: break out of loop on !PageBuddy in isolate_freepages_blockLaura Abbott
commit 2af120bc040c5ebcda156df6be6a66610ab6957f upstream. We received several reports of bad page state when freeing CMA pages previously allocated with alloc_contig_range: BUG: Bad page state in process Binder_A pfn:63202 page:d21130b0 count:0 mapcount:1 mapping: (null) index:0x7dfbf page flags: 0x40080068(uptodate|lru|active|swapbacked) Based on the page state, it looks like the page was still in use. The page flags do not make sense for the use case though. Further debugging showed that despite alloc_contig_range returning success, at least one page in the range still remained in the buddy allocator. There is an issue with isolate_freepages_block. In strict mode (which CMA uses), if any pages in the range cannot be isolated, isolate_freepages_block should return failure 0. The current check keeps track of the total number of isolated pages and compares against the size of the range: if (strict && nr_strict_required > total_isolated) total_isolated = 0; After taking the zone lock, if one of the pages in the range is not in the buddy allocator, we continue through the loop and do not increment total_isolated. If in the last iteration of the loop we isolate more than one page (e.g. last page needed is a higher order page), the check for total_isolated may pass and we fail to detect that a page was skipped. The fix is to bail out if the loop immediately if we are in strict mode. There's no benfit to continuing anyway since we need all pages to be isolated. Additionally, drop the error checking based on nr_strict_required and just check the pfn ranges. This matches with what isolate_freepages_range does. Signed-off-by: Laura Abbott <lauraa@codeaurora.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-13Merge branch 'linux-3.10.33' into dev-kernel-3.10Deepak Nibade
Bug 1456092 Change-Id: I3021247ec68a3c2dddd9e98cde13d70a45191d53 Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
2014-03-06memcg: fix endless loop caused by mem_cgroup_iterMichal Hocko
commit ecc736fc3c71c411a9d201d8588c9e7e049e5d8c upstream. Hugh has reported an endless loop when the hardlimit reclaim sees the same group all the time. This might happen when the reclaim races with the memcg removal. shrink_zone [rmdir root] mem_cgroup_iter(root, NULL, reclaim) // prev = NULL rcu_read_lock() mem_cgroup_iter_load last_visited = iter->last_visited // gets root || NULL css_tryget(last_visited) // failed last_visited = NULL [1] memcg = root = __mem_cgroup_iter_next(root, NULL) mem_cgroup_iter_update iter->last_visited = root; reclaim->generation = iter->generation mem_cgroup_iter(root, root, reclaim) // prev = root rcu_read_lock mem_cgroup_iter_load last_visited = iter->last_visited // gets root css_tryget(last_visited) // failed [1] The issue seemed to be introduced by commit 5f5781619718 ("memcg: relax memcg iter caching") which has replaced unconditional css_get/css_put by css_tryget/css_put for the cached iterator. This patch fixes the issue by skipping css_tryget on the root of the tree walk in mem_cgroup_iter_load and symmetrically doesn't release it in mem_cgroup_iter_update. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Hugh Dickins <hughd@google.com> Tested-by: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Greg Thelen <gthelen@google.com> Cc: <stable@vger.kernel.org> [3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22mm/memory-failure.c: move refcount only in !MF_COUNT_INCREASEDNaoya Horiguchi
commit 8d547ff4ac5927245e0833ac18528f939da0ee0e upstream. mce-test detected a test failure when injecting error to a thp tail page. This is because we take page refcount of the tail page in madvise_hwpoison() while the fix in commit a3e0f9e47d5e ("mm/memory-failure.c: transfer page count from head page to tail page after split thp") assumes that we always take refcount on the head page. When a real memory error happens we take refcount on the head page where memory_failure() is called without MF_COUNT_INCREASED set, so it seems to me that testing memory error on thp tail page using madvise makes little sense. This patch cancels moving refcount in !MF_COUNT_INCREASED for valid testing. [akpm@linux-foundation.org: s/&&/&/] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Chen Gong <gong.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-20mm: fix process accidentally killed by mce because of huge page migrationXishi Qiu
Based on c8721bbbdd36382de51cd6b7a56322e0acca2414 upstream, but only the bugfix portion pulled out. Hi Naoya or Greg, We found a bug in 3.10.x. The problem is that we accidentally have a hwpoisoned hugepage in free hugepage list. It could happend in the the following scenario: process A process B migrate_huge_page put_page (old hugepage) linked to free hugepage list hugetlb_fault hugetlb_no_page alloc_huge_page dequeue_huge_page_vma dequeue_huge_page_node (steal hwpoisoned hugepage) set_page_hwpoison_huge_page dequeue_hwpoisoned_huge_page (fail to dequeue) I tested this bug, one process keeps allocating huge page, and I use sysfs interface to soft offline a huge page, then received: "MCE: Killing UCP:2717 due to hardware memory corruption fault at 8200034" Upstream kernel is free from this bug because of these two commits: f15bdfa802bfa5eb6b4b5a241b97ec9fa1204a35 mm/memory-failure.c: fix memory leak in successful soft offlining c8721bbbdd36382de51cd6b7a56322e0acca2414 mm: memory-hotplug: enable memory hotplug to handle hugepage The first one, although the problem is about memory leak, this patch moves unset_migratetype_isolate(), which is important to avoid the race. The latter is not a bug fix and it's too big, so I rewrite a small one. The following patch can fix this bug.(please apply f15bdfa802bf first) Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-20mm/memory-failure.c: fix memory leak in successful soft offliningNaoya Horiguchi
commit f15bdfa802bfa5eb6b4b5a241b97ec9fa1204a35 upstream. After a successful page migration by soft offlining, the source page is not properly freed and it's never reusable even if we unpoison it afterward. This is caused by the race between freeing page and setting PG_hwpoison. In successful soft offlining, the source page is put (and the refcount becomes 0) by putback_lru_page() in unmap_and_move(), where it's linked to pagevec and actual freeing back to buddy is delayed. So if PG_hwpoison is set for the page before freeing, the freeing does not functions as expected (in such case freeing aborts in free_pages_prepare() check.) This patch tries to make sure to free the source page before setting PG_hwpoison on it. To avoid reallocating, the page keeps MIGRATE_ISOLATE until after setting PG_hwpoison. This patch also removes obsolete comments about "keeping elevated refcount" because what they say is not true. Unlike memory_failure(), soft_offline_page() uses no special page isolation code, and the soft-offlined pages have no elevated. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-20mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of ↵KOSAKI Motohiro
spin_lock_irq() commit a85d9df1ea1d23682a0ed1e100e6965006595d06 upstream. During aio stress test, we observed the following lockdep warning. This mean AIO+numa_balancing is currently deadlockable. The problem is, aio_migratepage disable interrupt, but __set_page_dirty_nobuffers unintentionally enable it again. Generally, all helper function should use spin_lock_irqsave() instead of spin_lock_irq() because they don't know caller at all. other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&ctx->completion_lock)->rlock); <Interrupt> lock(&(&ctx->completion_lock)->rlock); *** DEADLOCK *** dump_stack+0x19/0x1b print_usage_bug+0x1f7/0x208 mark_lock+0x21d/0x2a0 mark_held_locks+0xb9/0x140 trace_hardirqs_on_caller+0x105/0x1d0 trace_hardirqs_on+0xd/0x10 _raw_spin_unlock_irq+0x2c/0x50 __set_page_dirty_nobuffers+0x8c/0xf0 migrate_page_copy+0x434/0x540 aio_migratepage+0xb1/0x140 move_to_new_page+0x7d/0x230 migrate_pages+0x5e5/0x700 migrate_misplaced_page+0xbc/0xf0 do_numa_page+0x102/0x190 handle_pte_fault+0x241/0x970 handle_mm_fault+0x265/0x370 __do_page_fault+0x172/0x5a0 do_page_fault+0x1a/0x70 page_fault+0x28/0x30 Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13mm, oom: base root bonus on current usageDavid Rientjes
commit 778c14affaf94a9e4953179d3e13a544ccce7707 upstream. A 3% of system memory bonus is sometimes too excessive in comparison to other processes. With commit a63d83f427fb ("oom: badness heuristic rewrite"), the OOM killer tries to avoid killing privileged tasks by subtracting 3% of overall memory (system or cgroup) from their per-task consumption. But as a result, all root tasks that consume less than 3% of overall memory are considered equal, and so it only takes 33+ privileged tasks pushing the system out of memory for the OOM killer to do something stupid and kill dhclient or other root-owned processes. For example, on a 32G machine it can't tell the difference between the 1M agetty and the 10G fork bomb member. The changelog describes this 3% boost as the equivalent to the global overcommit limit being 3% higher for privileged tasks, but this is not the same as discounting 3% of overall memory from _every privileged task individually_ during OOM selection. Replace the 3% of system memory bonus with a 3% of current memory usage bonus. By giving root tasks a bonus that is proportional to their actual size, they remain comparable even when relatively small. In the example above, the OOM killer will discount the 1M agetty's 256 badness points down to 179, and the 10G fork bomb's 262144 points down to 183500 points and make the right choice, instead of discounting both to 0 and killing agetty because it's first in the task list. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13slub: Fix calculation of cpu slabsLi Zefan
commit 8afb1474db4701d1ab80cd8251137a3260e6913e upstream. /sys/kernel/slab/:t-0000048 # cat cpu_slabs 231 N0=16 N1=215 /sys/kernel/slab/:t-0000048 # cat slabs 145 N0=36 N1=109 See, the number of slabs is smaller than that of cpu slabs. The bug was introduced by commit 49e2258586b423684f03c278149ab46d8f8b6700 ("slub: per cpu cache for partial pages"). We should use page->pages instead of page->pobjects when calculating the number of cpu partial slabs. This also fixes the mapping of slabs and nodes. As there's no variable storing the number of total/active objects in cpu partial slabs, and we don't have user interfaces requiring those statistics, I just add WARN_ON for those cases. Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13mm/page-writeback.c: do not count anon pages as dirtyable memoryJohannes Weiner
commit a1c3bfb2f67ef766de03f1f56bdfff9c8595ab14 upstream. The VM is currently heavily tuned to avoid swapping. Whether that is good or bad is a separate discussion, but as long as the VM won't swap to make room for dirty cache, we can not consider anonymous pages when calculating the amount of dirtyable memory, the baseline to which dirty_background_ratio and dirty_ratio are applied. A simple workload that occupies a significant size (40+%, depending on memory layout, storage speeds etc.) of memory with anon/tmpfs pages and uses the remainder for a streaming writer demonstrates this problem. In that case, the actual cache pages are a small fraction of what is considered dirtyable overall, which results in an relatively large portion of the cache pages to be dirtied. As kswapd starts rotating these, random tasks enter direct reclaim and stall on IO. Only consider free pages and file pages dirtyable. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Tested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13mm/page-writeback.c: fix dirty_balance_reserve subtraction from dirtyable memoryJohannes Weiner
commit a804552b9a15c931cfc2a92a2e0aed1add8b580a upstream. Tejun reported stuttering and latency spikes on a system where random tasks would enter direct reclaim and get stuck on dirty pages. Around 50% of memory was occupied by tmpfs backed by an SSD, and another disk (rotating) was reading and writing at max speed to shrink a partition. : The problem was pretty ridiculous. It's a 8gig machine w/ one ssd and 10k : rpm harddrive and I could reliably reproduce constant stuttering every : several seconds for as long as buffered IO was going on on the hard drive : either with tmpfs occupying somewhere above 4gig or a test program which : allocates about the same amount of anon memory. Although swap usage was : zero, turning off swap also made the problem go away too. : : The trigger conditions seem quite plausible - high anon memory usage w/ : heavy buffered IO and swap configured - and it's highly likely that this : is happening in the wild too. (this can happen with copying large files : to usb sticks too, right?) This patch (of 2): The dirty_balance_reserve is an approximation of the fraction of free pages that the page allocator does not make available for page cache allocations. As a result, it has to be taken into account when calculating the amount of "dirtyable memory", the baseline to which dirty_background_ratio and dirty_ratio are applied. However, currently the reserve is subtracted from the sum of free and reclaimable pages, which is non-sensical and leads to erroneous results when the system is dominated by unreclaimable pages and the dirty_balance_reserve is bigger than free+reclaimable. In that case, at least the already allocated cache should be considered dirtyable. Fix the calculation by subtracting the reserve from the amount of free pages, then adding the reclaimable pages on top. [akpm@linux-foundation.org: fix CONFIG_HIGHMEM build] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Tested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13mm/memory-failure.c: shift page lock from head page to tail page after thp splitNaoya Horiguchi
commit 54b9dd14d09f24927285359a227aa363ce46089e upstream. After thp split in hwpoison_user_mappings(), we hold page lock on the raw error page only between try_to_unmap, hence we are in danger of race condition. I found in the RHEL7 MCE-relay testing that we have "bad page" error when a memory error happens on a thp tail page used by qemu-kvm: Triggering MCE exception on CPU 10 mce: [Hardware Error]: Machine check events logged MCE exception done on CPU 10 MCE 0x38c535: Killing qemu-kvm:8418 due to hardware memory corruption MCE 0x38c535: dirty LRU page recovery: Recovered qemu-kvm[8418]: segfault at 20 ip 00007ffb0f0f229a sp 00007fffd6bc5240 error 4 in qemu-kvm[7ffb0ef14000+420000] BUG: Bad page state in process qemu-kvm pfn:38c400 page:ffffea000e310000 count:0 mapcount:0 mapping: (null) index:0x7ffae3c00 page flags: 0x2fffff0008001d(locked|referenced|uptodate|dirty|swapbacked) Modules linked in: hwpoison_inject mce_inject vhost_net macvtap macvlan ... CPU: 0 PID: 8418 Comm: qemu-kvm Tainted: G M -------------- 3.10.0-54.0.1.el7.mce_test_fixed.x86_64 #1 Hardware name: NEC NEC Express5800/R120b-1 [N8100-1719F]/MS-91E7-001, BIOS 4.6.3C19 02/10/2011 Call Trace: dump_stack+0x19/0x1b bad_page.part.59+0xcf/0xe8 free_pages_prepare+0x148/0x160 free_hot_cold_page+0x31/0x140 free_hot_cold_page_list+0x46/0xa0 release_pages+0x1c1/0x200 free_pages_and_swap_cache+0xad/0xd0 tlb_flush_mmu.part.46+0x4c/0x90 tlb_finish_mmu+0x55/0x60 exit_mmap+0xcb/0x170 mmput+0x67/0xf0 vhost_dev_cleanup+0x231/0x260 [vhost_net] vhost_net_release+0x3f/0x90 [vhost_net] __fput+0xe9/0x270 ____fput+0xe/0x10 task_work_run+0xc4/0xe0 do_exit+0x2bb/0xa40 do_group_exit+0x3f/0xa0 get_signal_to_deliver+0x1d0/0x6e0 do_signal+0x48/0x5e0 do_notify_resume+0x71/0xc0 retint_signal+0x48/0x8c The reason of this bug is that a page fault happens before unlocking the head page at the end of memory_failure(). This strange page fault is trying to access to address 0x20 and I'm not sure why qemu-kvm does this, but anyway as a result the SIGSEGV makes qemu-kvm exit and on the way we catch the bad page bug/warning because we try to free a locked page (which was the former head page.) To fix this, this patch suggests to shift page lock from head page to tail page just after thp split. SIGSEGV still happens, but it affects only error affected VMs, not a whole system. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07misc: tegra-profiler: use mmap callsIgor Nabirushkin
Tegra Profiler: add mmap calls Bug 1447662 Change-Id: I96614ab3c320fd028cf861ea970b5199bdcae1c7 Signed-off-by: Igor Nabirushkin <inabirushkin@nvidia.com> Reviewed-on: http://git-master/r/360152 Reviewed-by: Bharat Nihalani <bnihalani@nvidia.com> Tested-by: Bharat Nihalani <bnihalani@nvidia.com>
2014-02-06mm/mempolicy.c: fix mempolicy printing in numa_mapsDavid Rientjes
commit 8790c71a18e5d2d93532ae250bcf5eddbba729cd upstream. As a result of commit 5606e3877ad8 ("mm: numa: Migrate on reference policy"), /proc/<pid>/numa_maps prints the mempolicy for any <pid> as "prefer:N" for the local node, N, of the process reading the file. This should only be printed when the mempolicy of <pid> is MPOL_PREFERRED for node N. If the process is actually only using the default mempolicy for local node allocation, make sure "default" is printed as expected. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Robert Lippert <rlippert@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: <stable@vger.kernel.org> [3.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06mm: hugetlbfs: fix hugetlbfs optimizationAndrea Arcangeli
commit 27c73ae759774e63313c1fbfeb17ba076cea64c5 upstream. Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database caused by THP") can cause dereference of a dangling pointer if split_huge_page runs during PageHuge() if there are updates to the tail_page->private field. Also it is repeating compound_head twice for hugetlbfs and it is running compound_head+compound_trans_head for THP when a single one is needed in both cases. The new code within the PageSlab() check doesn't need to verify that the THP page size is never bigger than the smallest hugetlbfs page size, to avoid memory corruption. A longstanding theoretical race condition was found while fixing the above (see the change right after the skip_unlock label, that is relevant for the compound_lock path too). By re-establishing the _mapcount tail refcounting for all compound pages, this also fixes the below problem: echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages BUG: Bad page state in process bash pfn:59a01 page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0 page flags: 0x1c00000000008000(tail) Modules linked in: CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x55/0x76 bad_page+0xd5/0x130 free_pages_prepare+0x213/0x280 __free_pages+0x36/0x80 update_and_free_page+0xc1/0xd0 free_pool_huge_page+0xc2/0xe0 set_max_huge_pages.part.58+0x14c/0x220 nr_hugepages_store_common.isra.60+0xd0/0xf0 nr_hugepages_store+0x13/0x20 kobj_attr_store+0xf/0x20 sysfs_write_file+0x189/0x1e0 vfs_write+0xc5/0x1f0 SyS_write+0x55/0xb0 system_call_fastpath+0x16/0x1b Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Guillaume Morin <guillaume@morinfr.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-25thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once onlyHugh Dickins
commit eecc1e426d681351a6026a7d3e7d225f38955b6c upstream. We see General Protection Fault on RSI in copy_page_rep: that RSI is what you get from a NULL struct page pointer. RIP: 0010:[<ffffffff81154955>] [<ffffffff81154955>] copy_page_rep+0x5/0x10 RSP: 0000:ffff880136e15c00 EFLAGS: 00010286 RAX: ffff880000000000 RBX: ffff880136e14000 RCX: 0000000000000200 RDX: 6db6db6db6db6db7 RSI: db73880000000000 RDI: ffff880dd0c00000 RBP: ffff880136e15c18 R08: 0000000000000200 R09: 000000000005987c R10: 000000000005987c R11: 0000000000000200 R12: 0000000000000001 R13: ffffea00305aa000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f195752f700(0000) GS:ffff880c7fc20000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000093010000 CR3: 00000001458e1000 CR4: 00000000000027e0 Call Trace: copy_user_huge_page+0x93/0xab do_huge_pmd_wp_page+0x710/0x815 handle_mm_fault+0x15d8/0x1d70 __do_page_fault+0x14d/0x840 do_page_fault+0x2f/0x90 page_fault+0x22/0x30 do_huge_pmd_wp_page() tests is_huge_zero_pmd(orig_pmd) four times: but since shrink_huge_zero_page() can free the huge_zero_page, and we have no hold of our own on it here (except where the fourth test holds page_table_lock and has checked pmd_same), it's possible for it to answer yes the first time, but no to the second or third test. Change all those last three to tests for NULL page. (Note: this is not the same issue as trinity's DEBUG_PAGEALLOC BUG in copy_page_rep with RSI: ffff88009c422000, reported by Sasha Levin in https://lkml.org/lkml/2013/3/29/103. I believe that one is due to the source page being split, and a tail page freed, while copy is in progress; and not a problem without DEBUG_PAGEALLOC, since the pmd_same check will prevent a miscopy from being made visible.) Fixes: 97ae17497e99 ("thp: implement refcounting for huge zero page") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-25mm/memory-failure.c: recheck PageHuge() after hugetlb page migrate successfullyJianguo Wu
commit a49ecbcd7b0d5a1cda7d60e03df402dd0ef76ac8 upstream. After a successful hugetlb page migration by soft offline, the source page will either be freed into hugepage_freelists or buddy(over-commit page). If page is in buddy, page_hstate(page) will be NULL. It will hit a NULL pointer dereference in dequeue_hwpoisoned_huge_page(). BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 IP: [<ffffffff81163761>] dequeue_hwpoisoned_huge_page+0x131/0x1d0 PGD c23762067 PUD c24be2067 PMD 0 Oops: 0000 [#1] SMP So check PageHuge(page) after call migrate_pages() successfully. [wujg: backport to 3.10: - adjust context] Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-22mm: Add TLB flush all thresholdTerje Bergstrom
With arm64 the range passed to flush_tlb_kernel_range() can be huge. Add a parameter for how large area will be flushed by area. Beyond that the whole TLB is flushed. Set the threshold to 512MB. Bug 1432908 Change-Id: I685fc2c3ffaad8979bfa8fde27e8c2948a2104cc Signed-off-by: Terje Bergstrom <tbergstrom@nvidia.com> Reviewed-on: http://git-master/r/351026 Reviewed-by: Juha Tukkinen <jtukkinen@nvidia.com>
2014-01-22mm: Fix printing of address in arm64Terje Bergstrom
Change-Id: I3a09d5dd6abc07ef79c64899053e1de3b4f6a95f Signed-off-by: Terje Bergstrom <tbergstrom@nvidia.com> Reviewed-on: http://git-master/r/351025 Reviewed-by: Juha Tukkinen <jtukkinen@nvidia.com>
2014-01-20mm: Enable Kernel Samepage MergingSri Krishna chowdary
Bug 1400829 Change-Id: I46956125eef8b5bf35e350f65ae5f27fd043f0f2 Signed-off-by: Sri Krishna chowdary <schowdary@nvidia.com> Reviewed-on: http://git-master/r/354085 Reviewed-by: Automatic_Commit_Validation_User GVS: Gerrit_Virtual_Submit Reviewed-by: Hiroshi Doyu <hdoyu@nvidia.com> Tested-by: Hiroshi Doyu <hdoyu@nvidia.com>
2014-01-09memcg: fix memcg_size() calculationVladimir Davydov
commit 695c60830764945cf61a2cc623eb1392d137223e upstream. The mem_cgroup structure contains nr_node_ids pointers to mem_cgroup_per_node objects, not the objects themselves. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm/memory-failure.c: transfer page count from head page to tail page after ↵Naoya Horiguchi
split thp commit a3e0f9e47d5ef7858a26cc12d90ad5146e802d47 upstream. Memory failures on thp tail pages cause kernel panic like below: mce: [Hardware Error]: Machine check events logged MCE exception done on CPU 7 BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 IP: [<ffffffff811b7cd1>] dequeue_hwpoisoned_huge_page+0x131/0x1e0 PGD bae42067 PUD ba47d067 PMD 0 Oops: 0000 [#1] SMP ... CPU: 7 PID: 128 Comm: kworker/7:2 Tainted: G M O 3.13.0-rc4-131217-1558-00003-g83b7df08e462 #25 ... Call Trace: me_huge_page+0x3e/0x50 memory_failure+0x4bb/0xc20 mce_process_work+0x3e/0x70 process_one_work+0x171/0x420 worker_thread+0x11b/0x3a0 ? manage_workers.isra.25+0x2b0/0x2b0 kthread+0xe4/0x100 ? kthread_create_on_node+0x190/0x190 ret_from_fork+0x7c/0xb0 ? kthread_create_on_node+0x190/0x190 ... RIP dequeue_hwpoisoned_huge_page+0x131/0x1e0 CR2: 0000000000000058 The reasoning of this problem is shown below: - when we have a memory error on a thp tail page, the memory error handler grabs a refcount of the head page to keep the thp under us. - Before unmapping the error page from processes, we split the thp, where page refcounts of both of head/tail pages don't change. - Then we call try_to_unmap() over the error page (which was a tail page before). We didn't pin the error page to handle the memory error, this error page is freed and removed from LRU list. - We never have the error page on LRU list, so the first page state check returns "unknown page," then we move to the second check with the saved page flag. - The saved page flag have PG_tail set, so the second page state check returns "hugepage." - We call me_huge_page() for freed error page, then we hit the above panic. The root cause is that we didn't move refcount from the head page to the tail page after split thp. So this patch suggests to do this. This panic was introduced by commit 524fca1e73 ("HWPOISON: fix misjudgement of page_action() for errors on mlocked pages"). Note that we did have the same refcount problem before this commit, but it was just ignored because we had only first page state check which returned "unknown page." The commit changed the refcount problem from "doesn't work" to "kernel panic." Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm: fix use-after-free in sys_remap_file_pagesRik van Riel
commit 4eb919825e6c3c7fb3630d5621f6d11e98a18b3a upstream. remap_file_pages calls mmap_region, which may merge the VMA with other existing VMAs, and free "vma". This can lead to a use-after-free bug. Avoid the bug by remembering vm_flags before calling mmap_region, and not trying to dereference vma later. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: PaX Team <pageexec@freemail.hu> Cc: Kees Cook <keescook@chromium.org> Cc: Michel Lespinasse <walken@google.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm/hugetlb: check for pte NULL pointer in __page_check_address()Jianguo Wu
commit 98398c32f6687ee1e1f3ae084effb4b75adb0747 upstream. In __page_check_address(), if address's pud is not present, huge_pte_offset() will return NULL, we should check the return value. Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: qiuxishi <qiuxishi@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm/compaction: respect ignore_skip_hint in update_pageblock_skipJoonsoo Kim
commit 6815bf3f233e0b10c99a758497d5d236063b010b upstream. update_pageblock_skip() only fits to compaction which tries to isolate by pageblock unit. If isolate_migratepages_range() is called by CMA, it try to isolate regardless of pageblock unit and it don't reference get_pageblock_skip() by ignore_skip_hint. We should also respect it on update_pageblock_skip() to prevent from setting the wrong information. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm: fix TLB flush race between migration, and change_protection_rangeRik van Riel
commit 20841405940e7be0617612d521e206e4b6b325db upstream. There are a few subtle races, between change_protection_range (used by mprotect and change_prot_numa) on one side, and NUMA page migration and compaction on the other side. The basic race is that there is a time window between when the PTE gets made non-present (PROT_NONE or NUMA), and the TLB is flushed. During that time, a CPU may continue writing to the page. This is fine most of the time, however compaction or the NUMA migration code may come in, and migrate the page away. When that happens, the CPU may continue writing, through the cached translation, to what is no longer the current memory location of the process. This only affects x86, which has a somewhat optimistic pte_accessible. All other architectures appear to be safe, and will either always flush, or flush whenever there is a valid mapping, even with no permissions (SPARC). The basic race looks like this: CPU A CPU B CPU C load TLB entry make entry PTE/PMD_NUMA fault on entry read/write old page start migrating page change PTE/PMD to new page read/write old page [*] flush TLB reload TLB from new entry read/write new page lose data [*] the old page may belong to a new user at this point! The obvious fix is to flush remote TLB entries, by making sure that pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may still be accessible if there is a TLB flush pending for the mm. This should fix both NUMA migration and compaction. [mgorman@suse.de: fix build] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm: numa: avoid unnecessary work on the failure pathMel Gorman
commit eb4489f69f224356193364dc2762aa009738ca7f upstream. If a PMD changes during a THP migration then migration aborts but the failure path is doing more work than is necessary. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>