From 586a32ac1d33ce7a7548a27e4087e98842c3a06f Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Wed, 11 Sep 2013 14:22:27 -0700 Subject: mm: munlock: remove unnecessary call to lru_add_drain() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In munlock_vma_range(), lru_add_drain() is currently called in a loop before each munlock_vma_page() call. This is suboptimal for performance when munlocking many pages. The benefits of per-cpu pagevec for batching the LRU putback are removed since the pagevec only holds at most one page from the previous loop's iteration. The lru_add_drain() call also does not serve any purposes for correctness - it does not even drain pagavecs of all cpu's. The munlock code already expects and handles situations where a page cannot be isolated from the LRU (e.g. because it is on some per-cpu pagevec). The history of the (not commented) call also suggest that it appears there as an oversight rather than intentionally. Before commit ff6a6da6 ("mm: accelerate munlock() treatment of THP pages") the call happened only once upon entering the function. The commit has moved the call into the while loope. So while the other changes in the commit improved munlock performance for THP pages, it introduced the abovementioned suboptimal per-cpu pagevec usage. Further in history, before commit 408e82b7 ("mm: munlock use follow_page"), munlock_vma_pages_range() was just a wrapper around __mlock_vma_pages_range which performed both mlock and munlock depending on a flag. However, before ba470de4 ("mmap: handle mlocked pages during map, remap, unmap") the function handled only mlock, not munlock. The lru_add_drain call thus comes from the implementation in commit b291f000 ("mlock: mlocked pages are unevictable" and was intended only for mlocking, not munlocking. The original intention of draining the LRU pagevec at mlock time was to ensure the pages were on the LRU before the lock operation so that they could be placed on the unevictable list immediately. There is very little motivation to do the same in the munlock path this, particularly for every single page. This patch therefore removes the call completely. After removing the call, a 10% speedup was measured for munlock() of a 56GB large memory area with THP disabled. Signed-off-by: Vlastimil Babka Reviewed-by: Jörn Engel Acked-by: Mel Gorman Cc: Michel Lespinasse Cc: Hugh Dickins Cc: Rik van Riel Cc: Johannes Weiner Cc: Michal Hocko Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mlock.c | 1 - 1 file changed, 1 deletion(-) (limited to 'mm/mlock.c') diff --git a/mm/mlock.c b/mm/mlock.c index 79b7cf7d1bca..b85f1e827610 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -247,7 +247,6 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, &page_mask); if (page && !IS_ERR(page)) { lock_page(page); - lru_add_drain(); /* * Any THP page found by follow_page_mask() may have * gotten split before reaching munlock_vma_page(), -- cgit v1.2.3 From 7225522bb429a2f7dae6667e533e2d735b4882d0 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Wed, 11 Sep 2013 14:22:29 -0700 Subject: mm: munlock: batch non-THP page isolation and munlock+putback using pagevec MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, munlock_vma_range() calls munlock_vma_page on each page in a loop, which results in repeated taking and releasing of the lru_lock spinlock for isolating pages one by one. This patch batches the munlock operations using an on-stack pagevec, so that isolation is done under single lru_lock. For THP pages, the old behavior is preserved as they might be split while putting them into the pagevec. After this patch, a 9% speedup was measured for munlocking a 56GB large memory area with THP disabled. A new function __munlock_pagevec() is introduced that takes a pagevec and: 1) It clears PageMlocked and isolates all pages under lru_lock. Zone page stats can be also updated using the variant which assumes disabled interrupts. 2) It finishes the munlock and lru putback on all pages under their lock_page. Note that previously, lock_page covered also the PageMlocked clearing and page isolation, but it is not needed for those operations. Signed-off-by: Vlastimil Babka Reviewed-by: Jörn Engel Acked-by: Mel Gorman Cc: Michel Lespinasse Cc: Hugh Dickins Cc: Rik van Riel Cc: Johannes Weiner Cc: Michal Hocko Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mlock.c | 196 ++++++++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 156 insertions(+), 40 deletions(-) (limited to 'mm/mlock.c') diff --git a/mm/mlock.c b/mm/mlock.c index b85f1e827610..b3b4a78b7802 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -18,6 +19,8 @@ #include #include #include +#include +#include #include "internal.h" @@ -87,6 +90,47 @@ void mlock_vma_page(struct page *page) } } +/* + * Finish munlock after successful page isolation + * + * Page must be locked. This is a wrapper for try_to_munlock() + * and putback_lru_page() with munlock accounting. + */ +static void __munlock_isolated_page(struct page *page) +{ + int ret = SWAP_AGAIN; + + /* + * Optimization: if the page was mapped just once, that's our mapping + * and we don't need to check all the other vmas. + */ + if (page_mapcount(page) > 1) + ret = try_to_munlock(page); + + /* Did try_to_unlock() succeed or punt? */ + if (ret != SWAP_MLOCK) + count_vm_event(UNEVICTABLE_PGMUNLOCKED); + + putback_lru_page(page); +} + +/* + * Accounting for page isolation fail during munlock + * + * Performs accounting when page isolation fails in munlock. There is nothing + * else to do because it means some other task has already removed the page + * from the LRU. putback_lru_page() will take care of removing the page from + * the unevictable list, if necessary. vmscan [page_referenced()] will move + * the page back to the unevictable list if some other vma has it mlocked. + */ +static void __munlock_isolation_failed(struct page *page) +{ + if (PageUnevictable(page)) + count_vm_event(UNEVICTABLE_PGSTRANDED); + else + count_vm_event(UNEVICTABLE_PGMUNLOCKED); +} + /** * munlock_vma_page - munlock a vma page * @page - page to be unlocked @@ -112,37 +156,10 @@ unsigned int munlock_vma_page(struct page *page) unsigned int nr_pages = hpage_nr_pages(page); mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); page_mask = nr_pages - 1; - if (!isolate_lru_page(page)) { - int ret = SWAP_AGAIN; - - /* - * Optimization: if the page was mapped just once, - * that's our mapping and we don't need to check all the - * other vmas. - */ - if (page_mapcount(page) > 1) - ret = try_to_munlock(page); - /* - * did try_to_unlock() succeed or punt? - */ - if (ret != SWAP_MLOCK) - count_vm_event(UNEVICTABLE_PGMUNLOCKED); - - putback_lru_page(page); - } else { - /* - * Some other task has removed the page from the LRU. - * putback_lru_page() will take care of removing the - * page from the unevictable list, if necessary. - * vmscan [page_referenced()] will move the page back - * to the unevictable list if some other vma has it - * mlocked. - */ - if (PageUnevictable(page)) - count_vm_event(UNEVICTABLE_PGSTRANDED); - else - count_vm_event(UNEVICTABLE_PGMUNLOCKED); - } + if (!isolate_lru_page(page)) + __munlock_isolated_page(page); + else + __munlock_isolation_failed(page); } return page_mask; @@ -209,6 +226,73 @@ static int __mlock_posix_error_return(long retval) return retval; } +/* + * Munlock a batch of pages from the same zone + * + * The work is split to two main phases. First phase clears the Mlocked flag + * and attempts to isolate the pages, all under a single zone lru lock. + * The second phase finishes the munlock only for pages where isolation + * succeeded. + * + * Note that pvec is modified during the process. Before returning + * pagevec_reinit() is called on it. + */ +static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) +{ + int i; + int nr = pagevec_count(pvec); + + /* Phase 1: page isolation */ + spin_lock_irq(&zone->lru_lock); + for (i = 0; i < nr; i++) { + struct page *page = pvec->pages[i]; + + if (TestClearPageMlocked(page)) { + struct lruvec *lruvec; + int lru; + + /* we have disabled interrupts */ + __mod_zone_page_state(zone, NR_MLOCK, -1); + + if (PageLRU(page)) { + lruvec = mem_cgroup_page_lruvec(page, zone); + lru = page_lru(page); + + get_page(page); + ClearPageLRU(page); + del_page_from_lru_list(page, lruvec, lru); + } else { + __munlock_isolation_failed(page); + goto skip_munlock; + } + + } else { +skip_munlock: + /* + * We won't be munlocking this page in the next phase + * but we still need to release the follow_page_mask() + * pin. + */ + pvec->pages[i] = NULL; + put_page(page); + } + } + spin_unlock_irq(&zone->lru_lock); + + /* Phase 2: page munlock and putback */ + for (i = 0; i < nr; i++) { + struct page *page = pvec->pages[i]; + + if (page) { + lock_page(page); + __munlock_isolated_page(page); + unlock_page(page); + put_page(page); /* pin from follow_page_mask() */ + } + } + pagevec_reinit(pvec); +} + /* * munlock_vma_pages_range() - munlock all pages in the vma range.' * @vma - vma containing range to be munlock()ed. @@ -230,11 +314,16 @@ static int __mlock_posix_error_return(long retval) void munlock_vma_pages_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) { + struct pagevec pvec; + struct zone *zone = NULL; + + pagevec_init(&pvec, 0); vma->vm_flags &= ~VM_LOCKED; while (start < end) { struct page *page; unsigned int page_mask, page_increm; + struct zone *pagezone; /* * Although FOLL_DUMP is intended for get_dump_page(), @@ -246,20 +335,47 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP, &page_mask); if (page && !IS_ERR(page)) { - lock_page(page); - /* - * Any THP page found by follow_page_mask() may have - * gotten split before reaching munlock_vma_page(), - * so we need to recompute the page_mask here. - */ - page_mask = munlock_vma_page(page); - unlock_page(page); - put_page(page); + pagezone = page_zone(page); + /* The whole pagevec must be in the same zone */ + if (pagezone != zone) { + if (pagevec_count(&pvec)) + __munlock_pagevec(&pvec, zone); + zone = pagezone; + } + if (PageTransHuge(page)) { + /* + * THP pages are not handled by pagevec due + * to their possible split (see below). + */ + if (pagevec_count(&pvec)) + __munlock_pagevec(&pvec, zone); + lock_page(page); + /* + * Any THP page found by follow_page_mask() may + * have gotten split before reaching + * munlock_vma_page(), so we need to recompute + * the page_mask here. + */ + page_mask = munlock_vma_page(page); + unlock_page(page); + put_page(page); /* follow_page_mask() */ + } else { + /* + * Non-huge pages are handled in batches + * via pagevec. The pin from + * follow_page_mask() prevents them from + * collapsing by THP. + */ + if (pagevec_add(&pvec, page) == 0) + __munlock_pagevec(&pvec, zone); + } } page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask); start += page_increm * PAGE_SIZE; cond_resched(); } + if (pagevec_count(&pvec)) + __munlock_pagevec(&pvec, zone); } /* -- cgit v1.2.3 From 1ebb7cc6a58321a4b22c4c9097b4651b0ab859d0 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Wed, 11 Sep 2013 14:22:30 -0700 Subject: mm: munlock: batch NR_MLOCK zone state updates MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Depending on previous batch which introduced batched isolation in munlock_vma_range(), we can batch also the updates of NR_MLOCK page stats. After the whole pagevec is processed for page isolation, the stats are updated only once with the number of successful isolations. There were however no measurable perfomance gains. Signed-off-by: Vlastimil Babka Reviewed-by: Jörn Engel Acked-by: Mel Gorman Cc: Michel Lespinasse Cc: Hugh Dickins Cc: Rik van Riel Cc: Johannes Weiner Cc: Michal Hocko Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mlock.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'mm/mlock.c') diff --git a/mm/mlock.c b/mm/mlock.c index b3b4a78b7802..b1a7c8007c89 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -241,6 +241,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) { int i; int nr = pagevec_count(pvec); + int delta_munlocked = -nr; /* Phase 1: page isolation */ spin_lock_irq(&zone->lru_lock); @@ -251,9 +252,6 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) struct lruvec *lruvec; int lru; - /* we have disabled interrupts */ - __mod_zone_page_state(zone, NR_MLOCK, -1); - if (PageLRU(page)) { lruvec = mem_cgroup_page_lruvec(page, zone); lru = page_lru(page); @@ -275,8 +273,10 @@ skip_munlock: */ pvec->pages[i] = NULL; put_page(page); + delta_munlocked++; } } + __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); spin_unlock_irq(&zone->lru_lock); /* Phase 2: page munlock and putback */ -- cgit v1.2.3 From 56afe477df3cbbcd656682d0355ef7d9eb8bdd81 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Wed, 11 Sep 2013 14:22:32 -0700 Subject: mm: munlock: bypass per-cpu pvec for putback_lru_page MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit After introducing batching by pagevecs into munlock_vma_range(), we can further improve performance by bypassing the copying into per-cpu pagevec and the get_page/put_page pair associated with that. Instead we perform LRU putback directly from our pagevec. However, this is possible only for single-mapped pages that are evictable after munlock. Unevictable pages require rechecking after putting on the unevictable list, so for those we fallback to putback_lru_page(), hich handles that. After this patch, a 13% speedup was measured for munlocking a 56GB large memory area with THP disabled. [akpm@linux-foundation.org:clarify comment] Signed-off-by: Vlastimil Babka Reviewed-by: Jörn Engel Acked-by: Mel Gorman Cc: Michel Lespinasse Cc: Hugh Dickins Cc: Rik van Riel Cc: Johannes Weiner Cc: Michal Hocko Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mlock.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 69 insertions(+), 4 deletions(-) (limited to 'mm/mlock.c') diff --git a/mm/mlock.c b/mm/mlock.c index b1a7c8007c89..abdc612b042d 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -226,6 +226,52 @@ static int __mlock_posix_error_return(long retval) return retval; } +/* + * Prepare page for fast batched LRU putback via putback_lru_evictable_pagevec() + * + * The fast path is available only for evictable pages with single mapping. + * Then we can bypass the per-cpu pvec and get better performance. + * when mapcount > 1 we need try_to_munlock() which can fail. + * when !page_evictable(), we need the full redo logic of putback_lru_page to + * avoid leaving evictable page in unevictable list. + * + * In case of success, @page is added to @pvec and @pgrescued is incremented + * in case that the page was previously unevictable. @page is also unlocked. + */ +static bool __putback_lru_fast_prepare(struct page *page, struct pagevec *pvec, + int *pgrescued) +{ + VM_BUG_ON(PageLRU(page)); + VM_BUG_ON(!PageLocked(page)); + + if (page_mapcount(page) <= 1 && page_evictable(page)) { + pagevec_add(pvec, page); + if (TestClearPageUnevictable(page)) + (*pgrescued)++; + unlock_page(page); + return true; + } + + return false; +} + +/* + * Putback multiple evictable pages to the LRU + * + * Batched putback of evictable pages that bypasses the per-cpu pvec. Some of + * the pages might have meanwhile become unevictable but that is OK. + */ +static void __putback_lru_fast(struct pagevec *pvec, int pgrescued) +{ + count_vm_events(UNEVICTABLE_PGMUNLOCKED, pagevec_count(pvec)); + /* + *__pagevec_lru_add() calls release_pages() so we don't call + * put_page() explicitly + */ + __pagevec_lru_add(pvec); + count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued); +} + /* * Munlock a batch of pages from the same zone * @@ -242,6 +288,8 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) int i; int nr = pagevec_count(pvec); int delta_munlocked = -nr; + struct pagevec pvec_putback; + int pgrescued = 0; /* Phase 1: page isolation */ spin_lock_irq(&zone->lru_lock); @@ -279,17 +327,34 @@ skip_munlock: __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); spin_unlock_irq(&zone->lru_lock); - /* Phase 2: page munlock and putback */ + /* Phase 2: page munlock */ + pagevec_init(&pvec_putback, 0); for (i = 0; i < nr; i++) { struct page *page = pvec->pages[i]; if (page) { lock_page(page); - __munlock_isolated_page(page); - unlock_page(page); - put_page(page); /* pin from follow_page_mask() */ + if (!__putback_lru_fast_prepare(page, &pvec_putback, + &pgrescued)) { + /* Slow path */ + __munlock_isolated_page(page); + unlock_page(page); + } } } + + /* Phase 3: page putback for pages that qualified for the fast path */ + if (pagevec_count(&pvec_putback)) + __putback_lru_fast(&pvec_putback, pgrescued); + + /* Phase 4: put_page to return pin from follow_page_mask() */ + for (i = 0; i < nr; i++) { + struct page *page = pvec->pages[i]; + + if (page) + put_page(page); + } + pagevec_reinit(pvec); } -- cgit v1.2.3 From 5b40998ae35cf64561868370e6c9f3d3e94b6bf7 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Wed, 11 Sep 2013 14:22:33 -0700 Subject: mm: munlock: remove redundant get_page/put_page pair on the fast path MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The performance of the fast path in munlock_vma_range() can be further improved by avoiding atomic ops of a redundant get_page()/put_page() pair. When calling get_page() during page isolation, we already have the pin from follow_page_mask(). This pin will be then returned by __pagevec_lru_add(), after which we do not reference the pages anymore. After this patch, an 8% speedup was measured for munlocking a 56GB large memory area with THP disabled. Signed-off-by: Vlastimil Babka Reviewed-by: Jörn Engel Acked-by: Mel Gorman Cc: Michel Lespinasse Cc: Hugh Dickins Cc: Rik van Riel Cc: Johannes Weiner Cc: Michal Hocko Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mlock.c | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) (limited to 'mm/mlock.c') diff --git a/mm/mlock.c b/mm/mlock.c index abdc612b042d..19a934dce5d6 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -303,8 +303,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) if (PageLRU(page)) { lruvec = mem_cgroup_page_lruvec(page, zone); lru = page_lru(page); - - get_page(page); + /* + * We already have pin from follow_page_mask() + * so we can spare the get_page() here. + */ ClearPageLRU(page); del_page_from_lru_list(page, lruvec, lru); } else { @@ -336,25 +338,25 @@ skip_munlock: lock_page(page); if (!__putback_lru_fast_prepare(page, &pvec_putback, &pgrescued)) { - /* Slow path */ + /* + * Slow path. We don't want to lose the last + * pin before unlock_page() + */ + get_page(page); /* for putback_lru_page() */ __munlock_isolated_page(page); unlock_page(page); + put_page(page); /* from follow_page_mask() */ } } } - /* Phase 3: page putback for pages that qualified for the fast path */ + /* + * Phase 3: page putback for pages that qualified for the fast path + * This will also call put_page() to return pin from follow_page_mask() + */ if (pagevec_count(&pvec_putback)) __putback_lru_fast(&pvec_putback, pgrescued); - /* Phase 4: put_page to return pin from follow_page_mask() */ - for (i = 0; i < nr; i++) { - struct page *page = pvec->pages[i]; - - if (page) - put_page(page); - } - pagevec_reinit(pvec); } -- cgit v1.2.3 From 7a8010cd36273ff5f6fea5201ef9232f30cebbd9 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Wed, 11 Sep 2013 14:22:35 -0700 Subject: mm: munlock: manual pte walk in fast path instead of follow_page_mask() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently munlock_vma_pages_range() calls follow_page_mask() to obtain each individual struct page. This entails repeated full page table translations and page table lock taken for each page separately. This patch avoids the costly follow_page_mask() where possible, by iterating over ptes within single pmd under single page table lock. The first pte is obtained by get_locked_pte() for non-THP page acquired by the initial follow_page_mask(). The rest of the on-stack pagevec for munlock is filled up using pte_walk as long as pte_present() and vm_normal_page() are sufficient to obtain the struct page. After this patch, a 14% speedup was measured for munlocking a 56GB large memory area with THP disabled. Signed-off-by: Vlastimil Babka Cc: Jörn Engel Cc: Mel Gorman Cc: Michel Lespinasse Cc: Hugh Dickins Cc: Rik van Riel Cc: Johannes Weiner Cc: Michal Hocko Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mlock.c | 110 ++++++++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 79 insertions(+), 31 deletions(-) (limited to 'mm/mlock.c') diff --git a/mm/mlock.c b/mm/mlock.c index 19a934dce5d6..d63802663242 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -280,8 +280,7 @@ static void __putback_lru_fast(struct pagevec *pvec, int pgrescued) * The second phase finishes the munlock only for pages where isolation * succeeded. * - * Note that pvec is modified during the process. Before returning - * pagevec_reinit() is called on it. + * Note that the pagevec may be modified during the process. */ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) { @@ -356,8 +355,60 @@ skip_munlock: */ if (pagevec_count(&pvec_putback)) __putback_lru_fast(&pvec_putback, pgrescued); +} + +/* + * Fill up pagevec for __munlock_pagevec using pte walk + * + * The function expects that the struct page corresponding to @start address is + * a non-TPH page already pinned and in the @pvec, and that it belongs to @zone. + * + * The rest of @pvec is filled by subsequent pages within the same pmd and same + * zone, as long as the pte's are present and vm_normal_page() succeeds. These + * pages also get pinned. + * + * Returns the address of the next page that should be scanned. This equals + * @start + PAGE_SIZE when no page could be added by the pte walk. + */ +static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, + struct vm_area_struct *vma, int zoneid, unsigned long start, + unsigned long end) +{ + pte_t *pte; + spinlock_t *ptl; + + /* + * Initialize pte walk starting at the already pinned page where we + * are sure that there is a pte. + */ + pte = get_locked_pte(vma->vm_mm, start, &ptl); + end = min(end, pmd_addr_end(start, end)); + + /* The page next to the pinned page is the first we will try to get */ + start += PAGE_SIZE; + while (start < end) { + struct page *page = NULL; + pte++; + if (pte_present(*pte)) + page = vm_normal_page(vma, start, *pte); + /* + * Break if page could not be obtained or the page's node+zone does not + * match + */ + if (!page || page_zone_id(page) != zoneid) + break; - pagevec_reinit(pvec); + get_page(page); + /* + * Increase the address that will be returned *before* the + * eventual break due to pvec becoming full by adding the page + */ + start += PAGE_SIZE; + if (pagevec_add(pvec, page) == 0) + break; + } + pte_unmap_unlock(pte, ptl); + return start; } /* @@ -381,17 +432,16 @@ skip_munlock: void munlock_vma_pages_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) { - struct pagevec pvec; - struct zone *zone = NULL; - - pagevec_init(&pvec, 0); vma->vm_flags &= ~VM_LOCKED; while (start < end) { - struct page *page; + struct page *page = NULL; unsigned int page_mask, page_increm; - struct zone *pagezone; + struct pagevec pvec; + struct zone *zone; + int zoneid; + pagevec_init(&pvec, 0); /* * Although FOLL_DUMP is intended for get_dump_page(), * it just so happens that its special treatment of the @@ -400,22 +450,10 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, * has sneaked into the range, we won't oops here: great). */ page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP, - &page_mask); + &page_mask); + if (page && !IS_ERR(page)) { - pagezone = page_zone(page); - /* The whole pagevec must be in the same zone */ - if (pagezone != zone) { - if (pagevec_count(&pvec)) - __munlock_pagevec(&pvec, zone); - zone = pagezone; - } if (PageTransHuge(page)) { - /* - * THP pages are not handled by pagevec due - * to their possible split (see below). - */ - if (pagevec_count(&pvec)) - __munlock_pagevec(&pvec, zone); lock_page(page); /* * Any THP page found by follow_page_mask() may @@ -428,21 +466,31 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, put_page(page); /* follow_page_mask() */ } else { /* - * Non-huge pages are handled in batches - * via pagevec. The pin from - * follow_page_mask() prevents them from - * collapsing by THP. + * Non-huge pages are handled in batches via + * pagevec. The pin from follow_page_mask() + * prevents them from collapsing by THP. + */ + pagevec_add(&pvec, page); + zone = page_zone(page); + zoneid = page_zone_id(page); + + /* + * Try to fill the rest of pagevec using fast + * pte walk. This will also update start to + * the next page to process. Then munlock the + * pagevec. */ - if (pagevec_add(&pvec, page) == 0) - __munlock_pagevec(&pvec, zone); + start = __munlock_pagevec_fill(&pvec, vma, + zoneid, start, end); + __munlock_pagevec(&pvec, zone); + goto next; } } page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask); start += page_increm * PAGE_SIZE; +next: cond_resched(); } - if (pagevec_count(&pvec)) - __munlock_pagevec(&pvec, zone); } /* -- cgit v1.2.3 From 22356f447ceb8d97a4885792e7d9e4607f712e1b Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 24 Sep 2013 18:29:11 -0700 Subject: mm: Place preemption point in do_mlockall() loop There is a loop in do_mlockall() that lacks a preemption point, which means that the following can happen on non-preemptible builds of the kernel. Dave Jones reports: "My fuzz tester keeps hitting this. Every instance shows the non-irq stack came in from mlockall. I'm only seeing this on one box, but that has more ram (8gb) than my other machines, which might explain it. INFO: rcu_preempt self-detected stall on CPU { 3} (t=6500 jiffies g=470344 c=470343 q=0) sending NMI to all CPUs: NMI backtrace for cpu 3 CPU: 3 PID: 29664 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #32 Call Trace: lru_add_drain_all+0x15/0x20 SyS_mlockall+0xa5/0x1a0 tracesys+0xdd/0xe2" This commit addresses this problem by inserting the required preemption point. Reported-by: Dave Jones Signed-off-by: Paul E. McKenney Cc: KOSAKI Motohiro Cc: Michel Lespinasse Cc: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mlock.c | 1 + 1 file changed, 1 insertion(+) (limited to 'mm/mlock.c') diff --git a/mm/mlock.c b/mm/mlock.c index d63802663242..67ba6da7d0e3 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -736,6 +736,7 @@ static int do_mlockall(int flags) /* Ignore errors */ mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags); + cond_resched(); } out: return 0; -- cgit v1.2.3 From eadb41ae82f802105c0601aa8a0a0e7595826497 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Mon, 30 Sep 2013 13:45:18 -0700 Subject: mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The function __munlock_pagevec_fill() introduced in commit 7a8010cd3627 ("mm: munlock: manual pte walk in fast path instead of follow_page_mask()") uses pmd_addr_end() for restricting its operation within current page table. This is insufficient on architectures/configurations where pmd is folded and pmd_addr_end() just returns the end of the full range to be walked. In this case, it allows pte++ to walk off the end of a page table resulting in unpredictable behaviour. This patch fixes the function by using pgd_addr_end() and pud_addr_end() before pmd_addr_end(), which will yield correct page table boundary on all configurations. This is similar to what existing page walkers do when walking each level of the page table. Additionaly, the patch clarifies a comment for get_locked_pte() call in the function. Signed-off-by: Vlastimil Babka Reported-by: Fengguang Wu Reviewed-by: Bob Liu Cc: Jörn Engel Cc: Mel Gorman Cc: Michel Lespinasse Cc: Hugh Dickins Cc: Rik van Riel Cc: Johannes Weiner Cc: Michal Hocko Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mlock.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) (limited to 'mm/mlock.c') diff --git a/mm/mlock.c b/mm/mlock.c index 67ba6da7d0e3..d480cd6fc475 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -379,10 +379,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, /* * Initialize pte walk starting at the already pinned page where we - * are sure that there is a pte. + * are sure that there is a pte, as it was pinned under the same + * mmap_sem write op. */ pte = get_locked_pte(vma->vm_mm, start, &ptl); - end = min(end, pmd_addr_end(start, end)); + /* Make sure we do not cross the page table boundary */ + end = pgd_addr_end(start, end); + end = pud_addr_end(start, end); + end = pmd_addr_end(start, end); /* The page next to the pinned page is the first we will try to get */ start += PAGE_SIZE; -- cgit v1.2.3