From 9824cf9753ecbe8f5b47aa9b2f218207defea211 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Wed, 11 Sep 2013 14:20:23 -0700 Subject: mm: vmstats: tlb flush counters I was investigating some TLB flush scaling issues and realized that we do not have any good methods for figuring out how many TLB flushes we are doing. It would be nice to be able to do these in generic code, but the arch-independent calls don't explicitly specify whether we actually need to do remote flushes or not. In the end, we really need to know if we actually _did_ global vs. local invalidations, so that leaves us with few options other than to muck with the counters from arch-specific code. Signed-off-by: Dave Hansen Cc: Peter Zijlstra Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmstat.c | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'mm/vmstat.c') diff --git a/mm/vmstat.c b/mm/vmstat.c index 20c2ef4458fa..00382c53f582 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -817,6 +817,11 @@ const char * const vmstat_text[] = { "thp_zero_page_alloc", "thp_zero_page_alloc_failed", #endif + "nr_tlb_remote_flush", + "nr_tlb_remote_flush_received", + "nr_tlb_local_flush_all", + "nr_tlb_local_flush_one", + "nr_tlb_local_flush_one_kernel", #endif /* CONFIG_VM_EVENTS_COUNTERS */ }; -- cgit v1.2.3 From 6df46865ff8715932e7d42e52cac17e8461758cb Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Wed, 11 Sep 2013 14:20:24 -0700 Subject: mm: vmstats: track TLB flush stats on UP too The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled for SMP. UP systems do not do remote TLB flushes, so compile those counters out on UP. arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly. This is probably an optimization since both the mtrr code and __flush_tlb() write cr4. It would probably be safe to make that a flush_tlb_all() (and then get these statistics), but the mtrr code is ancient and I'm hesitant to touch it other than to just stick in the counters. [akpm@linux-foundation.org: tweak comments] Signed-off-by: Dave Hansen Cc: Peter Zijlstra Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmstat.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm/vmstat.c') diff --git a/mm/vmstat.c b/mm/vmstat.c index 00382c53f582..ca06e9653827 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -817,11 +817,12 @@ const char * const vmstat_text[] = { "thp_zero_page_alloc", "thp_zero_page_alloc_failed", #endif +#ifdef CONFIG_SMP "nr_tlb_remote_flush", "nr_tlb_remote_flush_received", +#endif "nr_tlb_local_flush_all", "nr_tlb_local_flush_one", - "nr_tlb_local_flush_one_kernel", #endif /* CONFIG_VM_EVENTS_COUNTERS */ }; -- cgit v1.2.3 From 81c0a2bb515fd4daae8cab64352877480792b515 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 11 Sep 2013 14:20:47 -0700 Subject: mm: page_alloc: fair zone allocator policy Each zone that holds userspace pages of one workload must be aged at a speed proportional to the zone size. Otherwise, the time an individual page gets to stay in memory depends on the zone it happened to be allocated in. Asymmetry in the zone aging creates rather unpredictable aging behavior and results in the wrong pages being reclaimed, activated etc. But exactly this happens right now because of the way the page allocator and kswapd interact. The page allocator uses per-node lists of all zones in the system, ordered by preference, when allocating a new page. When the first iteration does not yield any results, kswapd is woken up and the allocator retries. Due to the way kswapd reclaims zones below the high watermark while a zone can be allocated from when it is above the low watermark, the allocator may keep kswapd running while kswapd reclaim ensures that the page allocator can keep allocating from the first zone in the zonelist for extended periods of time. Meanwhile the other zones rarely see new allocations and thus get aged much slower in comparison. The result is that the occasional page placed in lower zones gets relatively more time in memory, even gets promoted to the active list after its peers have long been evicted. Meanwhile, the bulk of the working set may be thrashing on the preferred zone even though there may be significant amounts of memory available in the lower zones. Even the most basic test -- repeatedly reading a file slightly bigger than memory -- shows how broken the zone aging is. In this scenario, no single page should be able stay in memory long enough to get referenced twice and activated, but activation happens in spades: $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 0 nr_active_file 8 nr_inactive_file 1582 nr_active_file 11994 $ cat data data data data >/dev/null $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 70 nr_inactive_file 258753 nr_active_file 443214 nr_inactive_file 149793 nr_active_file 12021 Fix this with a very simple round robin allocator. Each zone is allowed a batch of allocations that is proportional to the zone's size, after which it is treated as full. The batch counters are reset when all zones have been tried and the allocator enters the slowpath and kicks off kswapd reclaim. Allocation and reclaim is now fairly spread out to all available/allowable zones: $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 174 nr_active_file 4865 nr_inactive_file 53 nr_active_file 860 $ cat data data data data >/dev/null $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 666622 nr_active_file 4988 nr_inactive_file 190969 nr_active_file 937 When zone_reclaim_mode is enabled, allocations will now spread out to all zones on the local node, not just the first preferred zone (which on a 4G node might be a tiny Normal zone). Signed-off-by: Johannes Weiner Acked-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Paul Bolle Cc: Zlatko Calusic Tested-by: Kevin Hilman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmstat.c | 1 + 1 file changed, 1 insertion(+) (limited to 'mm/vmstat.c') diff --git a/mm/vmstat.c b/mm/vmstat.c index ca06e9653827..8a8da1f9b044 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -703,6 +703,7 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat, const char * const vmstat_text[] = { /* Zoned VM counters */ "nr_free_pages", + "nr_alloc_batch", "nr_inactive_anon", "nr_active_anon", "nr_inactive_file", -- cgit v1.2.3 From 2bb921e526656556e68f99f5f15a4a1bf2691844 Mon Sep 17 00:00:00 2001 From: Christoph Lameter Date: Wed, 11 Sep 2013 14:21:30 -0700 Subject: vmstat: create separate function to fold per cpu diffs into local counters The main idea behind this patchset is to reduce the vmstat update overhead by avoiding interrupt enable/disable and the use of per cpu atomics. This patch (of 3): It is better to have a separate folding function because refresh_cpu_vm_stats() also does other things like expire pages in the page allocator caches. If we have a separate function then refresh_cpu_vm_stats() is only called from the local cpu which allows additional optimizations. The folding function is only called when a cpu is being downed and therefore no other processor will be accessing the counters. Also simplifies synchronization. [akpm@linux-foundation.org: fix UP build] Signed-off-by: Christoph Lameter Cc: KOSAKI Motohiro CC: Tejun Heo Cc: Joonsoo Kim Cc: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmstat.c | 40 ++++++++++++++++++++++++++++++++++------ 1 file changed, 34 insertions(+), 6 deletions(-) (limited to 'mm/vmstat.c') diff --git a/mm/vmstat.c b/mm/vmstat.c index 8a8da1f9b044..aaee66330e01 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -415,11 +415,7 @@ EXPORT_SYMBOL(dec_zone_page_state); #endif /* - * Update the zone counters for one cpu. - * - * The cpu specified must be either the current cpu or a processor that - * is not online. If it is the current cpu then the execution thread must - * be pinned to the current cpu. + * Update the zone counters for the current cpu. * * Note that refresh_cpu_vm_stats strives to only access * node local memory. The per cpu pagesets on remote zones are placed @@ -432,7 +428,7 @@ EXPORT_SYMBOL(dec_zone_page_state); * with the global counters. These could cause remote node cache line * bouncing and will have to be only done when necessary. */ -void refresh_cpu_vm_stats(int cpu) +static void refresh_cpu_vm_stats(int cpu) { struct zone *zone; int i; @@ -493,6 +489,38 @@ void refresh_cpu_vm_stats(int cpu) atomic_long_add(global_diff[i], &vm_stat[i]); } +/* + * Fold the data for an offline cpu into the global array. + * There cannot be any access by the offline cpu and therefore + * synchronization is simplified. + */ +void cpu_vm_stats_fold(int cpu) +{ + struct zone *zone; + int i; + int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; + + for_each_populated_zone(zone) { + struct per_cpu_pageset *p; + + p = per_cpu_ptr(zone->pageset, cpu); + + for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) + if (p->vm_stat_diff[i]) { + int v; + + v = p->vm_stat_diff[i]; + p->vm_stat_diff[i] = 0; + atomic_long_add(v, &zone->vm_stat[i]); + global_diff[i] += v; + } + } + + for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) + if (global_diff[i]) + atomic_long_add(global_diff[i], &vm_stat[i]); +} + /* * this is only called if !populated_zone(zone), which implies no other users of * pset->vm_stat_diff[] exsist. -- cgit v1.2.3 From 4edb0748b23887140578d68f5f4e6e2de337a481 Mon Sep 17 00:00:00 2001 From: Christoph Lameter Date: Wed, 11 Sep 2013 14:21:31 -0700 Subject: vmstat: create fold_diff Both functions that update global counters use the same mechanism. Create a function that contains the common code. Signed-off-by: Christoph Lameter Cc: KOSAKI Motohiro CC: Tejun Heo Cc: Joonsoo Kim Cc: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmstat.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) (limited to 'mm/vmstat.c') diff --git a/mm/vmstat.c b/mm/vmstat.c index aaee66330e01..158ca6494bc6 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -414,6 +414,15 @@ void dec_zone_page_state(struct page *page, enum zone_stat_item item) EXPORT_SYMBOL(dec_zone_page_state); #endif +static inline void fold_diff(int *diff) +{ + int i; + + for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) + if (diff[i]) + atomic_long_add(diff[i], &vm_stat[i]); +} + /* * Update the zone counters for the current cpu. * @@ -483,10 +492,7 @@ static void refresh_cpu_vm_stats(int cpu) drain_zone_pages(zone, &p->pcp); #endif } - - for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) - if (global_diff[i]) - atomic_long_add(global_diff[i], &vm_stat[i]); + fold_diff(global_diff); } /* @@ -516,9 +522,7 @@ void cpu_vm_stats_fold(int cpu) } } - for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) - if (global_diff[i]) - atomic_long_add(global_diff[i], &vm_stat[i]); + fold_diff(global_diff); } /* -- cgit v1.2.3 From fbc2edb05354480a88aa39db8a6acb5782fa1a1b Mon Sep 17 00:00:00 2001 From: Christoph Lameter Date: Wed, 11 Sep 2013 14:21:32 -0700 Subject: vmstat: use this_cpu() to avoid irqon/off sequence in refresh_cpu_vm_stats Disabling interrupts repeatedly can be avoided in the inner loop if we use a this_cpu operation. Signed-off-by: Christoph Lameter Cc: KOSAKI Motohiro CC: Tejun Heo Cc: Joonsoo Kim Cc: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmstat.c | 35 ++++++++++++++++------------------- 1 file changed, 16 insertions(+), 19 deletions(-) (limited to 'mm/vmstat.c') diff --git a/mm/vmstat.c b/mm/vmstat.c index 158ca6494bc6..d57a09143bf9 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -437,33 +437,29 @@ static inline void fold_diff(int *diff) * with the global counters. These could cause remote node cache line * bouncing and will have to be only done when necessary. */ -static void refresh_cpu_vm_stats(int cpu) +static void refresh_cpu_vm_stats(void) { struct zone *zone; int i; int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; for_each_populated_zone(zone) { - struct per_cpu_pageset *p; + struct per_cpu_pageset __percpu *p = zone->pageset; - p = per_cpu_ptr(zone->pageset, cpu); + for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) { + int v; - for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) - if (p->vm_stat_diff[i]) { - unsigned long flags; - int v; + v = this_cpu_xchg(p->vm_stat_diff[i], 0); + if (v) { - local_irq_save(flags); - v = p->vm_stat_diff[i]; - p->vm_stat_diff[i] = 0; - local_irq_restore(flags); atomic_long_add(v, &zone->vm_stat[i]); global_diff[i] += v; #ifdef CONFIG_NUMA /* 3 seconds idle till flush */ - p->expire = 3; + __this_cpu_write(p->expire, 3); #endif } + } cond_resched(); #ifdef CONFIG_NUMA /* @@ -473,23 +469,24 @@ static void refresh_cpu_vm_stats(int cpu) * Check if there are pages remaining in this pageset * if not then there is nothing to expire. */ - if (!p->expire || !p->pcp.count) + if (!__this_cpu_read(p->expire) || + !__this_cpu_read(p->pcp.count)) continue; /* * We never drain zones local to this processor. */ if (zone_to_nid(zone) == numa_node_id()) { - p->expire = 0; + __this_cpu_write(p->expire, 0); continue; } - p->expire--; - if (p->expire) + + if (__this_cpu_dec_return(p->expire)) continue; - if (p->pcp.count) - drain_zone_pages(zone, &p->pcp); + if (__this_cpu_read(p->pcp.count)) + drain_zone_pages(zone, __this_cpu_ptr(&p->pcp)); #endif } fold_diff(global_diff); @@ -1216,7 +1213,7 @@ int sysctl_stat_interval __read_mostly = HZ; static void vmstat_update(struct work_struct *w) { - refresh_cpu_vm_stats(smp_processor_id()); + refresh_cpu_vm_stats(); schedule_delayed_work(&__get_cpu_var(vmstat_work), round_jiffies_relative(sysctl_stat_interval)); } -- cgit v1.2.3 From 6e543d5780e36ff5ee56c44d7e2e30db3457a7ed Mon Sep 17 00:00:00 2001 From: Lisa Du Date: Wed, 11 Sep 2013 14:22:36 -0700 Subject: mm: vmscan: fix do_try_to_free_pages() livelock This patch is based on KOSAKI's work and I add a little more description, please refer https://lkml.org/lkml/2012/6/14/74. Currently, I found system can enter a state that there are lots of free pages in a zone but only order-0 and order-1 pages which means the zone is heavily fragmented, then high order allocation could make direct reclaim path's long stall(ex, 60 seconds) especially in no swap and no compaciton enviroment. This problem happened on v3.4, but it seems issue still lives in current tree, the reason is do_try_to_free_pages enter live lock: kswapd will go to sleep if the zones have been fully scanned and are still not balanced. As kswapd thinks there's little point trying all over again to avoid infinite loop. Instead it changes order from high-order to 0-order because kswapd think order-0 is the most important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep and may leave zone->all_unreclaimable =3D 0. It assume high-order users can still perform direct reclaim if they wish. Direct reclaim continue to reclaim for a high order which is not a COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble= . This is because to avoid too early oom-kill. So it means direct_reclaim depends on kswapd to break this loop. In worst case, direct-reclaim may continue to page reclaim forever when kswapd sleeps forever until someone like watchdog detect and finally kill the process. As described in: http://thread.gmane.org/gmane.linux.kernel.mm/103737 We can't turn on zone->all_unreclaimable from direct reclaim path because direct reclaim path don't take any lock and this way is racy. Thus this patch removes zone->all_unreclaimable field completely and recalculates zone reclaimable state every time. Note: we can't take the idea that direct-reclaim see zone->pages_scanned directly and kswapd continue to use zone->all_unreclaimable. Because, it is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use zone->all_unreclaimable as a name) describes the detail. [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()] Cc: Aaditya Kumar Cc: Ying Han Cc: Nick Piggin Acked-by: Rik van Riel Cc: Mel Gorman Cc: KAMEZAWA Hiroyuki Cc: Christoph Lameter Cc: Bob Liu Cc: Neil Zhang Cc: Russell King - ARM Linux Reviewed-by: Michal Hocko Acked-by: Minchan Kim Acked-by: Johannes Weiner Signed-off-by: KOSAKI Motohiro Signed-off-by: Lisa Du Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmstat.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'mm/vmstat.c') diff --git a/mm/vmstat.c b/mm/vmstat.c index d57a09143bf9..9bb314577911 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -19,6 +19,9 @@ #include #include #include +#include + +#include "internal.h" #ifdef CONFIG_VM_EVENT_COUNTERS DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}}; @@ -1088,7 +1091,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, "\n all_unreclaimable: %u" "\n start_pfn: %lu" "\n inactive_ratio: %u", - zone->all_unreclaimable, + !zone_reclaimable(zone), zone->zone_start_pfn, zone->inactive_ratio); seq_putc(m, '\n'); -- cgit v1.2.3