summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2018-04-19block: fix use-after-free in seq fileVegard Nossum
I got a KASAN report of use-after-free: ================================================================== BUG: KASAN: use-after-free in klist_iter_exit+0x61/0x70 at addr ffff8800b6581508 Read of size 8 by task trinity-c1/315 ============================================================================= BUG kmalloc-32 (Not tainted): kasan: bad access detected ----------------------------------------------------------------------------- Disabling lock debugging due to kernel taint INFO: Allocated in disk_seqf_start+0x66/0x110 age=144 cpu=1 pid=315 ___slab_alloc+0x4f1/0x520 __slab_alloc.isra.58+0x56/0x80 kmem_cache_alloc_trace+0x260/0x2a0 disk_seqf_start+0x66/0x110 traverse+0x176/0x860 seq_read+0x7e3/0x11a0 proc_reg_read+0xbc/0x180 do_loop_readv_writev+0x134/0x210 do_readv_writev+0x565/0x660 vfs_readv+0x67/0xa0 do_preadv+0x126/0x170 SyS_preadv+0xc/0x10 do_syscall_64+0x1a1/0x460 return_from_SYSCALL_64+0x0/0x6a INFO: Freed in disk_seqf_stop+0x42/0x50 age=160 cpu=1 pid=315 __slab_free+0x17a/0x2c0 kfree+0x20a/0x220 disk_seqf_stop+0x42/0x50 traverse+0x3b5/0x860 seq_read+0x7e3/0x11a0 proc_reg_read+0xbc/0x180 do_loop_readv_writev+0x134/0x210 do_readv_writev+0x565/0x660 vfs_readv+0x67/0xa0 do_preadv+0x126/0x170 SyS_preadv+0xc/0x10 do_syscall_64+0x1a1/0x460 return_from_SYSCALL_64+0x0/0x6a CPU: 1 PID: 315 Comm: trinity-c1 Tainted: G B 4.7.0+ #62 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 ffffea0002d96000 ffff880119b9f918 ffffffff81d6ce81 ffff88011a804480 ffff8800b6581500 ffff880119b9f948 ffffffff8146c7bd ffff88011a804480 ffffea0002d96000 ffff8800b6581500 fffffffffffffff4 ffff880119b9f970 Call Trace: [<ffffffff81d6ce81>] dump_stack+0x65/0x84 [<ffffffff8146c7bd>] print_trailer+0x10d/0x1a0 [<ffffffff814704ff>] object_err+0x2f/0x40 [<ffffffff814754d1>] kasan_report_error+0x221/0x520 [<ffffffff8147590e>] __asan_report_load8_noabort+0x3e/0x40 [<ffffffff83888161>] klist_iter_exit+0x61/0x70 [<ffffffff82404389>] class_dev_iter_exit+0x9/0x10 [<ffffffff81d2e8ea>] disk_seqf_stop+0x3a/0x50 [<ffffffff8151f812>] seq_read+0x4b2/0x11a0 [<ffffffff815f8fdc>] proc_reg_read+0xbc/0x180 [<ffffffff814b24e4>] do_loop_readv_writev+0x134/0x210 [<ffffffff814b4c45>] do_readv_writev+0x565/0x660 [<ffffffff814b8a17>] vfs_readv+0x67/0xa0 [<ffffffff814b8de6>] do_preadv+0x126/0x170 [<ffffffff814b92ec>] SyS_preadv+0xc/0x10 This problem can occur in the following situation: open() - pread() - .seq_start() - iter = kmalloc() // succeeds - seqf->private = iter - .seq_stop() - kfree(seqf->private) - pread() - .seq_start() - iter = kmalloc() // fails - .seq_stop() - class_dev_iter_exit(seqf->private) // boom! old pointer As the comment in disk_seqf_stop() says, stop is called even if start failed, so we need to reinitialise the private pointer to NULL when seq iteration stops. An alternative would be to set the private pointer to NULL when the kmalloc() in disk_seqf_start() fails. Bug 1823317 Bug 1935735 Change-Id: Ic3f82ef82c570866b48c5ea8e195d8e504570d80 Cc: stable@vger.kernel.org Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Gagan Grover <ggrover@nvidia.com> Reviewed-on: http://git-master/r/1259961 (cherry picked from commit 9d4a4a8711e9570c3ead013b64ff6e8bad05afbc) Reviewed-on: https://git-master.nvidia.com/r/1690293 GVS: Gerrit_Virtual_Submit Reviewed-by: Bibek Basu <bbasu@nvidia.com> Tested-by: Amulya Yarlagadda <ayarlagadda@nvidia.com> Reviewed-by: Winnie Hsu <whsu@nvidia.com>
2014-05-27fixup! kernel: remove CONFIG_USE_GENERIC_SMP_HELPERSDan Willemsen
This file was added as a result of bad conflict resolution in this backport. Remove it, as it isn't used. Signed-off-by: Dan Willemsen <dwillemsen@nvidia.com> Change-Id: Ide9fc67ca714d62a2fbec47d67922879bf3fa3f4 Reviewed-on: http://git-master/r/409755 Reviewed-by: Ishan Mittal <imittal@nvidia.com> Tested-by: Ishan Mittal <imittal@nvidia.com> Reviewed-by: Bharat Nihalani <bnihalani@nvidia.com>
2014-04-30kernel: remove CONFIG_USE_GENERIC_SMP_HELPERSChristoph Hellwig
We've switched over every architecture that supports SMP to it, so remove the new useless config variable. Conflicts: arch/arm/Kconfig block/blk-mq.c Change-Id: Ic19c3ac07a38a1636d6aa2fed5e55a58833f9b2c Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 0a06ff068f1255bcd7965ab07bc0f4adc3eb639a) Signed-off-by: Ishan Mittal <imittal@nvidia.com>
2014-03-13Merge branch 'linux-3.10.33' into dev-kernel-3.10Deepak Nibade
Bug 1456092 Change-Id: I3021247ec68a3c2dddd9e98cde13d70a45191d53 Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
2014-02-22block: add cond_resched() to potentially long running ioctl discard loopJens Axboe
commit c8123f8c9cb517403b51aa41c3c46ff5e10b2c17 upstream. When mkfs issues a full device discard and the device only supports discards of a smallish size, we can loop in blkdev_issue_discard() for a long time. If preempt isn't enabled, this can turn into a softlock situation and the kernel will start complaining. Add an explicit cond_resched() at the end of the loop to avoid that. Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22block: __elv_next_request() shouldn't call into the elevator if bypassingTejun Heo
commit 556ee818c06f37b2e583af0363e6b16d0e0270de upstream. request_queue bypassing is used to suppress higher-level function of a request_queue so that they can be switched, reconfigured and shut down. A request_queue does the followings while bypassing. * bypasses elevator and io_cq association and queues requests directly to the FIFO dispatch queue. * bypasses block cgroup request_list lookup and always uses the root request_list. Once confirmed to be bypassing, specific elevator and block cgroup policy implementations can assume that nothing is in flight for them and perform various operations which would be dangerous otherwise. Such confirmation is acheived by short-circuiting all new requests directly to the dispatch queue and waiting for all the requests which were issued before to finish. Unfortunately, while the request allocating and draining sides were properly handled, we forgot to actually plug the request dispatch path. Even after bypassing mode is confirmed, if the attached driver tries to fetch a request and the dispatch queue is empty, __elv_next_request() would invoke the current elevator's elevator_dispatch_fn() callback. As all in-flight requests were drained, the elevator wouldn't contain any request but once bypass is confirmed we don't even know whether the elevator is even there. It might be in the process of being switched and half torn down. Frank Mayhar reports that this actually happened while switching elevators, leading to an oops. Let's fix it by making __elv_next_request() avoid invoking the elevator_dispatch_fn() callback if the queue is bypassing. It already avoids invoking the callback if the queue is dying. As a dying queue is guaranteed to be bypassing, we can simply replace blk_queue_dying() check with blk_queue_bypass(). Reported-by: Frank Mayhar <fmayhar@google.com> References: http://lkml.kernel.org/g/1390319905.20232.38.camel@bobble.lax.corp.google.com Tested-by: Frank Mayhar <fmayhar@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-16Merge tag 'v3.10.24' into HEADAjay Nandakumar
This is the 3.10.24 stable release Change-Id: Ibd2734f93d44385ab86867272a1359158635133b
2013-12-11Update of blkg_stat and blkg_rwstat may happen in bh context. While ↵Hong Zhiguo
u64_stats_fetch_retry is only preempt_disable on 32bit UP system. This is not enough to avoid preemption by bh and may read strange 64 bit value. commit 2c575026fae6e63771bd2a4c1d407214a8096a89 upstream. Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08elevator: acquire q->sysfs_lock in elevator_change()Tomoki Sekiyama
commit 7c8a3679e3d8e9d92d58f282161760a0e247df97 upstream. Add locking of q->sysfs_lock into elevator_change() (an exported function) to ensure it is held to protect q->elevator from elevator_init(), even if elevator_change() is called from non-sysfs paths. sysfs path (elv_iosched_store) uses __elevator_change(), non-locking version, as the lock is already taken by elv_iosched_store(). Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama@hds.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08elevator: Fix a race in elevator switching and md device initializationTomoki Sekiyama
commit eb1c160b22655fd4ec44be732d6594fd1b1e44f4 upstream. The soft lockup below happens at the boot time of the system using dm multipath and the udev rules to switch scheduler. [ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483] [ 356.127001] RIP: 0010:[<ffffffff81072a7d>] [<ffffffff81072a7d>] lock_timer_base.isra.35+0x1d/0x50 ... [ 356.127001] Call Trace: [ 356.127001] [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70 [ 356.127001] [<ffffffff8118b08a>] ? kmem_cache_alloc_node_trace+0x20a/0x230 [ 356.127001] [<ffffffff810738b2>] del_timer_sync+0x52/0x60 [ 356.127001] [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0 [ 356.127001] [<ffffffff812c98df>] elevator_exit+0x2f/0x50 [ 356.127001] [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0 [ 356.127001] [<ffffffff812caa50>] elv_iosched_store+0x20/0x50 [ 356.127001] [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0 [ 356.127001] [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140 [ 356.127001] [<ffffffff811a326d>] vfs_write+0xbd/0x1e0 [ 356.127001] [<ffffffff811a3ca9>] SyS_write+0x49/0xa0 [ 356.127001] [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b This is caused by a race between md device initialization by multipathd and shell script to switch the scheduler using sysfs. - multipathd: SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init q->elevator = elevator_alloc(q, e); // not yet initialized - sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler': elevator_switch (in the call trace above) struct elevator_queue *old = q->elevator; q->elevator = elevator_alloc(q, new_e); elevator_exit(old); // lockup! (*) - multipathd: (cont.) err = e->ops.elevator_init_fn(q); // init fails; q->elevator is modified (*) When del_timer_sync() is called, lock_timer_base() will loop infinitely while timer->base == NULL. In this case, as timer will never initialized, it results in lockup. This patch introduces acquisition of q->sysfs_lock around elevator_init() into blk_init_allocated_queue(), to provide mutual exclusion between initialization of the q->scheduler and switching of the scheduler. This should fix this bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=902012 Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama@hds.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04blk-core: Fix memory corruption if blkcg_init_queue failsMikulas Patocka
commit fff4996b7db7955414ac74386efa5e07fd766b50 upstream. If blkcg_init_queue fails, blk_alloc_queue_node doesn't call bdi_destroy to clean up structures allocated by the backing dev. ------------[ cut here ]------------ WARNING: at lib/debugobjects.c:260 debug_print_object+0x85/0xa0() ODEBUG: free active (active state 0) object type: percpu_counter hint: (null) Modules linked in: dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev ipt_MASQUERADE iptable_nat nf_nat_ipv4 msr nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lm85 hwmon_vid snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq freq_table mperf sata_svw serverworks kvm_amd ide_core ehci_pci ohci_hcd libata ehci_hcd kvm usbcore tg3 usb_common libphy k10temp pcspkr ptp i2c_piix4 i2c_core evdev microcode hwmon rtc_cmos pps_core e100 skge floppy mii processor button unix CPU: 0 PID: 2739 Comm: lvchange Tainted: G W 3.10.15-devel #14 Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009 0000000000000009 ffff88023c3c1ae8 ffffffff813c8fd4 ffff88023c3c1b20 ffffffff810399eb ffff88043d35cd58 ffffffff81651940 ffff88023c3c1bf8 ffffffff82479d90 0000000000000005 ffff88023c3c1b80 ffffffff81039a67 Call Trace: [<ffffffff813c8fd4>] dump_stack+0x19/0x1b [<ffffffff810399eb>] warn_slowpath_common+0x6b/0xa0 [<ffffffff81039a67>] warn_slowpath_fmt+0x47/0x50 [<ffffffff8122aaaf>] ? debug_check_no_obj_freed+0xcf/0x250 [<ffffffff81229a15>] debug_print_object+0x85/0xa0 [<ffffffff8122abe3>] debug_check_no_obj_freed+0x203/0x250 [<ffffffff8113c4ac>] kmem_cache_free+0x20c/0x3a0 [<ffffffff811f6709>] blk_alloc_queue_node+0x2a9/0x2c0 [<ffffffff811f672e>] blk_alloc_queue+0xe/0x10 [<ffffffffa04c0093>] dm_create+0x1a3/0x530 [dm_mod] [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod] [<ffffffffa04c6c07>] dev_create+0x57/0x2b0 [dm_mod] [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod] [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod] [<ffffffffa04c6528>] ctl_ioctl+0x268/0x500 [dm_mod] [<ffffffff81097662>] ? get_lock_stats+0x22/0x70 [<ffffffffa04c67ce>] dm_ctl_ioctl+0xe/0x20 [dm_mod] [<ffffffff81161aad>] do_vfs_ioctl+0x2ed/0x520 [<ffffffff8116cfc7>] ? fget_light+0x377/0x4e0 [<ffffffff81161d2b>] SyS_ioctl+0x4b/0x90 [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f ---[ end trace 4b5ff0d55673d986 ]--- ------------[ cut here ]------------ This fix should be backported to stable kernels starting with 2.6.37. Note that in the kernels prior to 3.5 the affected code is different, but the bug is still there - bdi_init is called and bdi_destroy isn't. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29block: properly stack underlying max_segment_size to DM deviceMike Snitzer
commit d82ae52e68892338068e7559a0c0657193341ce4 upstream. Without this patch all DM devices will default to BLK_MAX_SEGMENT_SIZE (65536) even if the underlying device(s) have a larger value -- this is due to blk_stack_limits() using min_not_zero() when stacking the max_segment_size limit. 1073741824 before patch: 65536 after patch: 1073741824 Reported-by: Lukasz Flis <l.flis@cyfronet.pl> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29block: fix race between request completion and timeout handlingJeff Moyer
commit 4912aa6c11e6a5d910264deedbec2075c6f1bb73 upstream. crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 491, comm: scsi_eh_0 Tainted: G W ---------------- 2.6.32-220.13.1.el6.x86_64 #1 IBM -[8722PAX]-/00D1461 RIP: 0010:[<ffffffff8124e424>] [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0 RSP: 0018:ffff881057eefd60 EFLAGS: 00010012 RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8 RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780 RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338 R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000 FS: 0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540) Stack: 0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000 <0> ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393 <0> ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90 Call Trace: [<ffffffff81362323>] __scsi_queue_insert+0xa3/0x150 [<ffffffff8135f393>] ? scsi_eh_ready_devs+0x5e3/0x850 [<ffffffff81362a23>] scsi_queue_insert+0x13/0x20 [<ffffffff8135e4d4>] scsi_eh_flush_done_q+0x104/0x160 [<ffffffff8135fb6b>] scsi_error_handler+0x35b/0x660 [<ffffffff8135f810>] ? scsi_error_handler+0x0/0x660 [<ffffffff810908c6>] kthread+0x96/0xa0 [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffff81090830>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20 Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 RIP [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0 RSP <ffff881057eefd60> The RIP is this line: BUG_ON(blk_queued_rq(rq)); After digging through the code, I think there may be a race between the request completion and the timer handler running. A timer is started for each request put on the device's queue (see blk_start_request->blk_add_timer). If the request does not complete before the timer expires, the timer handler (blk_rq_timed_out_timer) will mark the request complete atomically: static inline int blk_mark_rq_complete(struct request *rq) { return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags); } and then call blk_rq_timed_out. The latter function will call scsi_times_out, which will return one of BLK_EH_HANDLED, BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED. If BLK_EH_RESET_TIMER is returned, blk_clear_rq_complete is called, and blk_add_timer is again called to simply wait longer for the request to complete. Now, if the request happens to complete while this is going on, what happens? Given that we know the completion handler will bail if it finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion handler running after that bit is cleared. So, from the above paragraph, after the call to blk_clear_rq_complete. If the completion sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom there (I haven't seen this in the cores). Next, if we get the completion before the call to list_add_tail, then the timer will eventually fire for an old req, which may either be freed or reallocated (there is evidence that this might be the case). Finally, if the completion comes in *after* the addition to the timeout list, I think it's harmless. The request will be removed from the timeout list, req_atom_complete will be set, and all will be well. This will only actually explain the coredumps *IF* the request structure was freed, reallocated *and* queued before the error handler thread had a chance to process it. That is possible, but it may make sense to keep digging for another race. I think that if this is what was happening, we would see other instances of this problem showing up as null pointer or garbage pointer dereferences, for example when the request structure was not re-used. It looks like we actually do run into that situation in other reports. This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags)); from blk_add_timer to the only caller that could trip over it (blk_start_request). It then inverts the calls to blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address the race. I've boot tested this patch, but nothing more. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-05fs: partitions: efi: Fix bound checkAntti P Miettinen
Use ARRAY_SIZE instead of sizeof to get proper max for label length. Bug 1401569 Change-Id: I98cfcca33bb8da18a791750ba8fa85d76a9afc05 Signed-off-by: Antti P Miettinen <amiettinen@nvidia.com> Reviewed-on: http://git-master/r/326759 GVS: Gerrit_Virtual_Submit Reviewed-by: Hiroshi Doyu <hdoyu@nvidia.com> Tested-by: Hiroshi Doyu <hdoyu@nvidia.com>
2013-10-31Merge tag 'v3.10.17' into dev-kernel-3.10Ajay Nandakumar
This is the 3.10.17 stable release Conflicts: drivers/usb/host/xhci.c Change-Id: I6bd3b15ff92a0b94568b9d02e9bb1036becfca20
2013-10-01cfq: explicitly use 64bit divide operation for 64bit argumentsAnatol Pomozov
commit f3cff25f05f2ac29b2ee355e611b0657482f6f1d upstream. 'samples' is 64bit operant, but do_div() second parameter is 32. do_div silently truncates high 32 bits and calculated result is invalid. In case if low 32bit of 'samples' are zeros then do_div() produces kernel crash. Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Jonghwan Choi <jhbird.choi@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-14fs: partitions: efi: Add force_gpt_sector parameterColin Cross
force_gpt_sector=<sector> causes the GPT partition search to look at the specified sector for a valid GPT header if the GPT is not found at the beginning or the end of the block device. Change-Id: I9b5f85ce24719c0538d42ec5a94344c7f6556b2b Signed-off-by: Colin Cross <ccross@android.com> Rebase-Id: Rbcbb8f2f0882f3750e748b5cc83038ca0f5940db
2013-09-14Merge commit 'a88f9e27498afaea615ad3e93af4f26df1f84987' into ↵Dan Willemsen
after-upstream-android Conflicts: arch/arm/common/Kconfig arch/arm/mm/Makefile arch/arm/mm/cache-l2x0.c arch/arm/mm/mmu.c drivers/input/Kconfig drivers/input/Makefile drivers/power/Kconfig kernel/futex.c
2013-08-20elevator: Fix a race in elevator switchingJianpeng Ma
commit d50235b7bc3ee0a0427984d763ea7534149531b4 upstream. There's a race between elevator switching and normal io operation. Because the allocation of struct elevator_queue and struct elevator_data don't in a atomic operation.So there are have chance to use NULL ->elevator_data. For example: Thread A: Thread B blk_queu_bio elevator_switch spin_lock_irq(q->queue_block) elevator_alloc elv_merge elevator_init_fn Because call elevator_alloc, it can't hold queue_lock and the ->elevator_data is NULL.So at the same time, threadA call elv_merge and nedd some info of elevator_data.So the crash happened. Move the elevator_alloc into func elevator_init_fn, it make the operations in a atomic operation. Using the follow method can easy reproduce this bug 1:dd if=/dev/sdb of=/dev/null 2:while true;do echo noop > scheduler;echo deadline > scheduler;done The test method also use this method. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Jonghwan Choi <jhbird.choi@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-13block: do not pass disk names as format stringsKees Cook
commit ffc8b30866879ed9ba62bd0a86fecdbd51cd3d19 upstream. Disk names may contain arbitrary strings, so they must not be interpreted as format strings. It seems that only md allows arbitrary strings to be used for disk names, but this could allow for a local memory corruption from uid 0 into ring 0. CVE-2013-2851 Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-01block: genhd: Add disk/partition specific uevent callbacks for partition infoSan Mehat
For disk devices, a new uevent parameter 'NPARTS' specifies the number of partitions detected by the kernel. Partition devices get 'PARTN' which specifies the partitions index in the table, and 'PARTNAME', which specifies PARTNAME specifices the partition name of a partition device Signed-off-by: Dima Zavin <dima@android.com>
2013-05-17blkpm: avoid sleep when holding queue lockAaron Lu
In blk_post_runtime_resume, an autosuspend request will be initiated for the device. Since we are holding the queue lock, we can't sleep and thus we should use the async version to initiate an autosuspend, i.e. pm_request_suspend instead of pm_runtime_suspend, which might sleep. Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-05-08Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block core updates from Jens Axboe: - Major bit is Kents prep work for immutable bio vecs. - Stable candidate fix for a scheduling-while-atomic in the queue bypass operation. - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging discard bios. - Tejuns changes to convert the writeback thread pool to the generic workqueue mechanism. - Runtime PM framework, SCSI patches exists on top of these in James' tree. - A few random fixes. * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits) relay: move remove_buf_file inside relay_close_buf partitions/efi.c: replace useless kzalloc's by kmalloc's fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read() block: fix max discard sectors limit blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Documentation: cfq-iosched: update documentation help for cfq tunables writeback: expose the bdi_wq workqueue writeback: replace custom worker pool implementation with unbound workqueue writeback: remove unused bdi_pending_list aoe: Fix unitialized var usage bio-integrity: Add explicit field for owner of bip_buf block: Add an explicit bio flag for bios that own their bvec block: Add bio_alloc_pages() block: Convert some code to bio_for_each_segment_all() block: Add bio_for_each_segment_all() bounce: Refactor __blk_queue_bounce to not use bi_io_vec raid1: use bio_copy_data() pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage pktcdvd: use bio_copy_data() block: Add bio_copy_data() ...
2013-05-07aio: don't include aio.h in sched.hKent Overstreet
Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-02Merge tag 'virtio-next-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull virtio & lguest updates from Rusty Russell: "Lots of virtio work which wasn't quite ready for last merge window. Plus I dived into lguest again, reworking the pagetable code so we can move the switcher page: our fixmaps sometimes take more than 2MB now..." Ugh. Annoying conflicts with the tcm_vhost -> vhost_scsi rename. Hopefully correctly resolved. * tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (57 commits) caif_virtio: Remove bouncing email addresses lguest: improve code readability in lg_cpu_start. virtio-net: fill only rx queues which are being used lguest: map Switcher below fixmap. lguest: cache last cpu we ran on. lguest: map Switcher text whenever we allocate a new pagetable. lguest: don't share Switcher PTE pages between guests. lguest: expost switcher_pages array (as lg_switcher_pages). lguest: extract shadow PTE walking / allocating. lguest: make check_gpte et. al return bool. lguest: assume Switcher text is a single page. lguest: rename switcher_page to switcher_pages. lguest: remove RESERVE_MEM constant. lguest: check vaddr not pgd for Switcher protection. lguest: prepare to make SWITCHER_ADDR a variable. virtio: console: replace EMFILE with EBUSY for already-open port virtio-scsi: reset virtqueue affinity when doing cpu hotplug virtio-scsi: introduce multiqueue support virtio-scsi: push vq lock/unlock into virtscsi_vq_done virtio-scsi: pass struct virtio_scsi to virtqueue completion function ...
2013-04-30partitions/efi.c: replace useless kzalloc's by kmalloc'sPhilippe De Muyter
In alloc_read_gpt_entries and alloc_read_gpt_header, the kzalloc'ated zones are either totally overwritten by the following read_lba call, or freed. As kmalloc is cheaper than kzalloc, use kmalloc. Signed-off-by: Philippe De Muyter <phdm@macqel.be> Cc: Matt Domsch <Matt_Domsch@dell.com> Cc: Panagiotis Issaris <takis@issaris.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-04-29Merge branch 'for-3.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Fixes and a lot of cleanups. Locking cleanup is finally complete. cgroup_mutex is no longer exposed to individual controlelrs which used to cause nasty deadlock issues. Li fixed and cleaned up quite a bit including long standing ones like racy cgroup_path(). - device cgroup now supports proper hierarchy thanks to Aristeu. - perf_event cgroup now supports proper hierarchy. - A new mount option "__DEVEL__sane_behavior" is added. As indicated by the name, this option is to be used for development only at this point and generates a warning message when used. Unfortunately, cgroup interface currently has too many brekages and inconsistencies to implement a consistent and unified hierarchy on top. The new flag is used to collect the behavior changes which are necessary to implement consistent unified hierarchy. It's likely that this flag won't be used verbatim when it becomes ready but will be enabled implicitly along with unified hierarchy. The option currently disables some of broken behaviors in cgroup core and also .use_hierarchy switch in memcg (will be routed through -mm), which can be used to make very unusual hierarchy where nesting is partially honored. It will also be used to implement hierarchy support for blk-throttle which would be impossible otherwise without introducing a full separate set of control knobs. This is essentially versioning of interface which isn't very nice but at this point I can't see any other options which would allow keeping the interface the same while moving towards hierarchy behavior which is at least somewhat sane. The planned unified hierarchy is likely to require some level of adaptation from userland anyway, so I think it'd be best to take the chance and update the interface such that it's supportable in the long term. Maintaining the existing interface does complicate cgroup core but shouldn't put too much strain on individual controllers and I think it'd be manageable for the foreseeable future. Maybe we'll be able to drop it in a decade. Fix up conflicts (including a semantic one adding a new #include to ppc that was uncovered by header the file changes) as per Tejun. * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits) cpuset: fix compile warning when CONFIG_SMP=n cpuset: fix cpu hotplug vs rebuild_sched_domains() race cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn() cgroup: restore the call to eventfd->poll() cgroup: fix use-after-free when umounting cgroupfs cgroup: fix broken file xattrs devcg: remove parent_cgroup. memcg: force use_hierarchy if sane_behavior cgroup: remove cgrp->top_cgroup cgroup: introduce sane_behavior mount option move cgroupfs_root to include/linux/cgroup.h cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix cgroup: make cgroup_path() not print double slashes Revert "cgroup: remove bind() method from cgroup_subsys." perf: make perf_event cgroup hierarchical cgroup: implement cgroup_is_descendant() cgroup: make sure parent won't be destroyed before its children cgroup: remove bind() method from cgroup_subsys. devcg: remove broken_hierarchy tag cgroup: remove cgroup_lock_is_held() ...
2013-04-29Merge tag 'driver-core-3.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core update from Greg Kroah-Hartman: "Here's the merge request for the driver core tree for 3.10-rc1 It's pretty small, just a number of driver core and sysfs updates and fixes, all of which have been in linux-next for a while now. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" Fixed conflict in kernel/rtmutex-tester.c, the locking tree had a better fix for the same sysfs file mode problem. * tag 'driver-core-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: PM / Runtime: Idle devices asynchronously after probe|release driver core: handle user namespaces properly with the uid/gid devtmpfs change driver core: devtmpfs: fix compile failure with CONFIG_UIDGID_STRICT_TYPE_CHECKS devtmpfs: add base.h include driver core: add uid and gid to devtmpfs sysfs: check if one entry has been removed before freeing sysfs: fix crash_notes_size build warning sysfs: fix use after free in case of concurrent read/write and readdir rtmutex-tester: fix mode of sysfs files Documentation: Add ABI entry for crash_notes and crash_notes_size sysfs: Add crash_notes_size to export percpu note size driver core: platform_device.h: fix checkpatch errors and warnings driver core: platform.c: fix checkpatch errors and warnings driver core: warn that platform_driver_probe can not use deferred probing sysfs: use atomic_inc_unless_negative in sysfs_get_active base: core: WARN() about bogus permissions on device attributes device: separate all subsys mutexes
2013-04-18Revert "block: add missing block_bio_complete() tracepoint"Linus Torvalds
This reverts commit 3a366e614d0837d9fc23f78cdb1a1186ebc3387f. Wanlong Gao reports that it causes a kernel panic on his machine several minutes after boot. Reverting it removes the panic. Jens says: "It's not quite clear why that is yet, so I think we should just revert the commit for 3.9 final (which I'm assuming is pretty close). The wifi is crap at the LSF hotel, so sending this email instead of queueing up a revert and pull request." Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Requested-by: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-14Merge 3.9-rc7 into driver-core-nextGreg Kroah-Hartman
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-11driver core: handle user namespaces properly with the uid/gid devtmpfs changeGreg Kroah-Hartman
Now that devtmpfs is caring about uid/gid, we need to use the correct internal types so users who have USER_NS enabled will have things work properly for them. Thanks to Eric for pointing this out, and the patch review. Reported-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Ming Lei <ming.lei@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-09blkcg: fix "scheduling while atomic" in blk_queue_bypass_startJun'ichi Nomura
Since 749fefe677 in v3.7 ("block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()"), the following warning appears when multipath is used with CONFIG_PREEMPT=y. This patch moves blk_queue_bypass_start() before radix_tree_preload() to avoid the sleeping call while preemption is disabled. BUG: scheduling while atomic: multipath/2460/0x00000002 1 lock held by multipath/2460: #0: (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod] Modules linked in: ... Pid: 2460, comm: multipath Tainted: G W 3.7.0-rc2 #1 Call Trace: [<ffffffff810723ae>] __schedule_bug+0x6a/0x78 [<ffffffff81428ba2>] __schedule+0xb4/0x5e0 [<ffffffff814291e6>] schedule+0x64/0x66 [<ffffffff8142773a>] schedule_timeout+0x39/0xf8 [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29 [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb [<ffffffff814289e3>] wait_for_common+0x9d/0xee [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206 [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a [<ffffffff8106fb18>] ? complete+0x21/0x53 [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20 [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62 [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270 [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108 [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod] [<ffffffff811d8c41>] elevator_init+0xe1/0x115 [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59 [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod] [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod] [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod] [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod] [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod] [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441 [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99 [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82 [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-04-08driver core: add uid and gid to devtmpfsKay Sievers
Some drivers want to tell userspace what uid and gid should be used for their device nodes, so allow that information to percolate through the driver core to userspace in order to make this happen. This means that some systems (i.e. Android and friends) will not need to even run a udev-like daemon for their device node manager and can just rely in devtmpfs fully, reducing their footprint even more. Signed-off-by: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-08Revert "loop: cleanup partitions when detaching loop device"Jens Axboe
This reverts commit 8761a3dc1f07b163414e2215a2cadbb4cfe2a107. There are situations where the destruction path is called with the bdev->bd_mutex already held, which then deadlocks in loop_clr_fd(). The normal partition cleanup does a trylock() on the mutex, but it'd be nice to have a more bullet proof method in loop. So punt this more involved fix to the next merge window, and just back out this buggy fix for now. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-04-03block: avoid using uninitialized value in from queue_var_storeArnd Bergmann
As found by gcc-4.8, the QUEUE_SYSFS_BIT_FNS macro creates functions that use a value generated by queue_var_store independent of whether that value was set or not. block/blk-sysfs.c: In function 'queue_store_nonrot': block/blk-sysfs.c:244:385: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized] Unlike most other such warnings, this one is not a false positive, writing any non-number string into the sysfs files indeed has an undefined result, rather than returning an error. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-04-02Merge branch 'writeback-workqueue' of ↵Jens Axboe
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-3.10/core Tejun writes: ----- This is the pull request for the earlier patchset[1] with the same name. It's only three patches (the first one was committed to workqueue tree) but the merge strategy is a bit involved due to the dependencies. * Because the conversion needs features from wq/for-3.10, block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those workqueue conflicts from flaring up in block tree. * Resolving the issue that Jan and Dave raised about debugging requires arch-wide changes. The patchset is being worked on[2] but it'll have to go through -mm after these changes show up in -next, and not included in this pull request. The three commits are located in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue Pulling it into block/for-3.10/core produces a conflict in drivers/md/raid5.c between the following two commits. e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available") 2f6db2a707 ("raid5: use bio_reset()") The conflict is trivial - one removes an "if ()" conditional while the other removes "rbi->bi_next = NULL" right above it. We just need to remove both. The merged branch is available at git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge so that you can use it for verification. The test merge commit has proper merge description. While these changes are a bit of pain to route, they make code simpler and even have, while minute, measureable performance gain[3] even on a workload which isn't particularly favorable to showing the benefits of this conversion. ---- Fixed up the conflict. Conflicts: drivers/md/raid5.c Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-24Merge branch 'for-jens' of http://evilpiepirate.org/git/linux-bcache into ↵Jens Axboe
for-3.10/core This contains Kents prep work for the immutable bio_vecs.
2013-03-23block: Add bio_end_sector()Kent Overstreet
Just a little convenience macro - main reason to add it now is preparing for immutable bio vecs, it'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Lars Ellenberg <drbd-dev@lists.linbit.com> CC: Jiri Kosina <jkosina@suse.cz> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Martin Schwidefsky <schwidefsky@de.ibm.com> CC: Heiko Carstens <heiko.carstens@de.ibm.com> CC: linux-s390@vger.kernel.org CC: Chris Mason <chris.mason@fusionio.com> CC: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2013-03-23block: Refactor blk_update_request()Kent Overstreet
Converts it to use bio_advance(), simplifying it quite a bit in the process. Note that req_bio_endio() now always calls bio_advance() - which means it always loops over the biovec, not just on partial completions. Don't expect it to affect performance, but worth noting. Tested it by forcing partial updates, and dumping before and after on various bio/bvec fields when doing a partial update. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk>
2013-03-22block: implement runtime pm strategyLin Ming
When a request is added: If device is suspended or is suspending and the request is not a PM request, resume the device. When the last request finishes: Call pm_runtime_mark_last_busy(). When pick a request: If device is resuming/suspending, then only PM request is allowed to go. The idea and API is designed by Alan Stern and described here: http://marc.info/?l=linux-scsi&m=133727953625963&w=2 Signed-off-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22block: add runtime pm helpersLin Ming
Add runtime pm helper functions: void blk_pm_runtime_init(struct request_queue *q, struct device *dev) - Initialization function for drivers to call. int blk_pre_runtime_suspend(struct request_queue *q) - If any requests are in the queue, mark last busy and return -EBUSY. Otherwise set q->rpm_status to RPM_SUSPENDING and return 0. void blk_post_runtime_suspend(struct request_queue *q, int err) - If the suspend succeeded then set q->rpm_status to RPM_SUSPENDED. Otherwise set it to RPM_ACTIVE and mark last busy. void blk_pre_runtime_resume(struct request_queue *q) - Set q->rpm_status to RPM_RESUMING. void blk_post_runtime_resume(struct request_queue *q, int err) - If the resume succeeded then set q->rpm_status to RPM_ACTIVE and call __blk_run_queue, then mark last busy and autosuspend. Otherwise set q->rpm_status to RPM_SUSPENDED. The idea and API is designed by Alan Stern and described here: http://marc.info/?l=linux-scsi&m=133727953625963&w=2 Signed-off-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22Block: blk-flush: Fixed indent code styleAlice Ferrazzi
Fixed code indent should use tabs where possible. Signed-off-by: Alice Ferrazzi <alice.ferrazzi@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22loop: cleanup partitions when detaching loop devicePhillip Susi
Any partitions added by user space to the loop device were being left in place after detaching the loop device. This was because the detach path issued a BLKRRPART to clean up partitions if LO_FLAGS_PARTSCAN was set, meaning that the partitions were auto scanned on attach. Replace this BLKRRPART with code that unconditionally cleans up partitions on detach instead. Signed-off-by: Phillip Susi <psusi@ubuntu.com> Modified by Jens to export delete_partition(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-20scatterlist: introduce sg_unmark_endPaolo Bonzini
This is useful in places that recycle the same scatterlist multiple times, and do not want to incur the cost of sg_init_table every time in hot paths. Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-03-04cgroup: fix cgroup_path() vs rename() raceLi Zefan
rename() will change dentry->d_name. The result of this race can be worse than seeing partially rewritten name, but we might access a stale pointer because rename() will re-allocate memory to hold a longer name. As accessing dentry->name must be protected by dentry->d_lock or parent inode's i_mutex, while on the other hand cgroup-path() can be called with some irq-safe spinlocks held, we can't generate cgroup path using dentry->d_name. Alternatively we make a copy of dentry->d_name and save it in cgrp->name when a cgroup is created, and update cgrp->name at rename(). v5: use flexible array instead of zero-size array. v4: - allocate root_cgroup_name and all root_cgroup->name points to it. - add cgroup_name() wrapper. v3: use kfree_rcu() instead of synchronize_rcu() in user-visible path. v2: make cgrp->name RCU safe. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-28Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block IO core bits from Jens Axboe: "Below are the core block IO bits for 3.9. It was delayed a few days since my workstation kept crashing every 2-8h after pulling it into current -git, but turns out it is a bug in the new pstate code (divide by zero, will report separately). In any case, it contains: - The big cfq/blkcg update from Tejun and and Vivek. - Additional block and writeback tracepoints from Tejun. - Improvement of the should sort (based on queues) logic in the plug flushing. - _io() variants of the wait_for_completion() interface, using io_schedule() instead of schedule() to contribute to io wait properly. - Various little fixes. You'll get two trivial merge conflicts, which should be easy enough to fix up" Fix up the trivial conflicts due to hlist traversal cleanups (commit b67bfe0d42ca: "hlist: drop the node parameter from iterators"). * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits) block: remove redundant check to bd_openers() block: use i_size_write() in bd_set_size() cfq: fix lock imbalance with failed allocations drivers/block/swim3.c: fix null pointer dereference block: don't select PERCPU_RWSEM block: account iowait time when waiting for completion of IO request sched: add wait_for_completion_io[_timeout] writeback: add more tracepoints block: add block_{touch|dirty}_buffer tracepoint buffer: make touch_buffer() an exported function block: add @req to bio_{front|back}_merge tracepoints block: add missing block_bio_complete() tracepoint block: Remove should_sort judgement when flush blk_plug block,elevator: use new hashtable implementation cfq-iosched: add hierarchical cfq_group statistics cfq-iosched: collect stats from dead cfqgs cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock block: RCU free request_queue blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() ...
2013-02-27hlist: drop the node parameter from iteratorsSasha Levin
I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27block/partitions: optimize memory allocation in check_partition()Ming Lei
Currently, sizeof(struct parsed_partitions) may be 64KB in 32bit arch, so it is easy to trigger page allocation failure by check_partition, especially in hotplug block device situation(such as, USB mass storage, MMC card, ...), and Felipe Balbi has observed the failure. This patch does below optimizations on the allocation of struct parsed_partitions to try to address the issue: - make parsed_partitions.parts as pointer so that the pointed memory can fit in 32KB buffer, then approximate 32KB memory can be saved - vmalloc the buffer pointed by parsed_partitions.parts because 32KB is still a bit big for kmalloc - given that many devices have the partition count limit, so only allocate disk_max_parts() partitions instead of 256 partitions always Signed-off-by: Ming Lei <ming.lei@canonical.com> Reported-by: Felipe Balbi <balbi@ti.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27block/partitions/mac.c: obey the state->limit constraintMing Lei
It isn't necessary to read the information of partitions whose number is equal and more than state->limit since only maximum state->limit partitions will be added inside rescan_partitions(). That is also what other kind of partitions are doing. Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27block/partitions/efi.c: ensure that the GPT header is at least the size of ↵Peter Jones
the structure. UEFI 2.3.1D will include a change to the spec language mandating that a GPT header must be greater than *or equal to* the size of the defined structure. While verifying that this would work on Linux, I discovered that we're not actually checking the minimum bound at all. The result of this is that when we verify the checksum, it's possible that on a malformed header (with header_size of 0), we won't actually verify any data. [akpm@linux-foundation.org: fix printk warning] Signed-off-by: Peter Jones <pjones@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Stephen Warren <swarren@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>