summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2018-08-24block: better op and flags encodingChristoph Hellwig
Now that we don't need the common flags to overflow outside the range of a 32-bit type we can encode them the same way for both the bio and request fields. This in addition allows us to place the operation first (and make some room for more ops while we're at it) and to stop having to shift around the operation values. In addition this allows passing around only one value in the block layer instead of two (and eventuall also in the file systems, but we can do that later) and thus clean up a lot of code. Last but not least this allows decreasing the size of the cmd_flags field in struct request to 32-bits. Various functions passing this value could also be updated, but I'd like to avoid the churn for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit ef295ecf090d3e86e5b742fc6ab34f1122a43773) Conflicts: block/blk-mq.c include/linux/blk_types.h include/linux/blkdev.h
2018-08-15proc: Fix proc_sys_prune_dcache to hold a sb referenceEric W. Biederman
commit 2fd1d2c4ceb2248a727696962cf3370dc9f5a0a4 upstream. Andrei Vagin writes: FYI: This bug has been reproduced on 4.11.7 > BUG: Dentry ffff895a3dd01240{i=4e7c09a,n=lo} still in use (1) [unmount of proc proc] > ------------[ cut here ]------------ > WARNING: CPU: 1 PID: 13588 at fs/dcache.c:1445 umount_check+0x6e/0x80 > CPU: 1 PID: 13588 Comm: kworker/1:1 Not tainted 4.11.7-200.fc25.x86_64 #1 > Hardware name: CompuLab sbc-flt1/fitlet, BIOS SBCFLT_0.08.04 06/27/2015 > Workqueue: events proc_cleanup_work > Call Trace: > dump_stack+0x63/0x86 > __warn+0xcb/0xf0 > warn_slowpath_null+0x1d/0x20 > umount_check+0x6e/0x80 > d_walk+0xc6/0x270 > ? dentry_free+0x80/0x80 > do_one_tree+0x26/0x40 > shrink_dcache_for_umount+0x2d/0x90 > generic_shutdown_super+0x1f/0xf0 > kill_anon_super+0x12/0x20 > proc_kill_sb+0x40/0x50 > deactivate_locked_super+0x43/0x70 > deactivate_super+0x5a/0x60 > cleanup_mnt+0x3f/0x90 > mntput_no_expire+0x13b/0x190 > kern_unmount+0x3e/0x50 > pid_ns_release_proc+0x15/0x20 > proc_cleanup_work+0x15/0x20 > process_one_work+0x197/0x450 > worker_thread+0x4e/0x4a0 > kthread+0x109/0x140 > ? process_one_work+0x450/0x450 > ? kthread_park+0x90/0x90 > ret_from_fork+0x2c/0x40 > ---[ end trace e1c109611e5d0b41 ]--- > VFS: Busy inodes after unmount of proc. Self-destruct in 5 seconds. Have a nice day... > BUG: unable to handle kernel NULL pointer dereference at (null) > IP: _raw_spin_lock+0xc/0x30 > PGD 0 Fix this by taking a reference to the super block in proc_sys_prune_dcache. The superblock reference is the core of the fix however the sysctl_inodes list is converted to a hlist so that hlist_del_init_rcu may be used. This allows proc_sys_prune_dache to remove inodes the sysctl_inodes list, while not causing problems for proc_sys_evict_inode when if it later choses to remove the inode from the sysctl_inodes list. Removing inodes from the sysctl_inodes list allows proc_sys_prune_dcache to have a progress guarantee, while still being able to drop all locks. The fact that head->unregistering is set in start_unregistering ensures that no more inodes will be added to the the sysctl_inodes list. Previously the code did a dance where it delayed calling iput until the next entry in the list was being considered to ensure the inode remained on the sysctl_inodes list until the next entry was walked to. The structure of the loop in this patch does not need that so is much easier to understand and maintain. Cc: stable@vger.kernel.org Reported-by: Andrei Vagin <avagin@gmail.com> Tested-by: Andrei Vagin <avagin@openvz.org> Fixes: ace0c791e6c3 ("proc/sysctl: Don't grab i_lock under sysctl_lock.") Fixes: d6cffbbe9a7e ("proc/sysctl: prune stale dentries during unregistering") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-15proc/sysctl: Don't grab i_lock under sysctl_lock.Eric W. Biederman
commit ace0c791e6c3cf5ef37cad2df69f0d90ccc40ffb upstream. Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > This patch has locking problem. I've got lockdep splat under LTP. > > [ 6633.115456] ====================================================== > [ 6633.115502] [ INFO: possible circular locking dependency detected ] > [ 6633.115553] 4.9.10-debug+ #9 Tainted: G L > [ 6633.115584] ------------------------------------------------------- > [ 6633.115627] ksm02/284980 is trying to acquire lock: > [ 6633.115659] (&sb->s_type->i_lock_key#4){+.+...}, at: [<ffffffff816bc1ce>] igrab+0x1e/0x80 > [ 6633.115834] but task is already holding lock: > [ 6633.115882] (sysctl_lock){+.+...}, at: [<ffffffff817e379b>] unregister_sysctl_table+0x6b/0x110 > [ 6633.116026] which lock already depends on the new lock. > [ 6633.116026] > [ 6633.116080] > [ 6633.116080] the existing dependency chain (in reverse order) is: > [ 6633.116117] > -> #2 (sysctl_lock){+.+...}: > -> #1 (&(&dentry->d_lockref.lock)->rlock){+.+...}: > -> #0 (&sb->s_type->i_lock_key#4){+.+...}: > > d_lock nests inside i_lock > sysctl_lock nests inside d_lock in d_compare > > This patch adds i_lock nesting inside sysctl_lock. Al Viro <viro@ZenIV.linux.org.uk> replied: > Once ->unregistering is set, you can drop sysctl_lock just fine. So I'd > try something like this - use rcu_read_lock() in proc_sys_prune_dcache(), > drop sysctl_lock() before it and regain after. Make sure that no inodes > are added to the list ones ->unregistering has been set and use RCU list > primitives for modifying the inode list, with sysctl_lock still used to > serialize its modifications. > > Freeing struct inode is RCU-delayed (see proc_destroy_inode()), so doing > igrab() is safe there. Since we don't drop inode reference until after we'd > passed beyond it in the list, list_for_each_entry_rcu() should be fine. I agree with Al Viro's analsysis of the situtation. Fixes: d6cffbbe9a7e ("proc/sysctl: prune stale dentries during unregistering") Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Tested-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-15proc/sysctl: prune stale dentries during unregisteringKonstantin Khlebnikov
commit d6cffbbe9a7e51eb705182965a189457c17ba8a3 upstream. Currently unregistering sysctl table does not prune its dentries. Stale dentries could slowdown sysctl operations significantly. For example, command: # for i in {1..100000} ; do unshare -n -- sysctl -a &> /dev/null ; done creates a millions of stale denties around sysctls of loopback interface: # sysctl fs.dentry-state fs.dentry-state = 25812579 24724135 45 0 0 0 All of them have matching names thus lookup have to scan though whole hash chain and call d_compare (proc_sys_compare) which checks them under system-wide spinlock (sysctl_lock). # time sysctl -a > /dev/null real 1m12.806s user 0m0.016s sys 1m12.400s Currently only memory reclaimer could remove this garbage. But without significant memory pressure this never happens. This patch collects sysctl inodes into list on sysctl table header and prunes all their dentries once that table unregisters. Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > On 10.02.2017 10:47, Al Viro wrote: >> how about >> the matching stats *after* that patch? > > dcache size doesn't grow endlessly, so stats are fine > > # sysctl fs.dentry-state > fs.dentry-state = 92712 58376 45 0 0 0 > > # time sysctl -a &>/dev/null > > real 0m0.013s > user 0m0.004s > sys 0m0.008s Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-15fix __legitimize_mnt()/mntput() raceAl Viro
commit 119e1ef80ecfe0d1deb6378d4ab41f5b71519de1 upstream. __legitimize_mnt() has two problems - one is that in case of success the check of mount_lock is not ordered wrt preceding increment of refcount, making it possible to have successful __legitimize_mnt() on one CPU just before the otherwise final mntpu() on another, with __legitimize_mnt() not seeing mntput() taking the lock and mntput() not seeing the increment done by __legitimize_mnt(). Solved by a pair of barriers. Another is that failure of __legitimize_mnt() on the second read_seqretry() leaves us with reference that'll need to be dropped by caller; however, if that races with final mntput() we can end up with caller dropping rcu_read_lock() and doing mntput() to release that reference - with the first mntput() having freed the damn thing just as rcu_read_lock() had been dropped. Solution: in "do mntput() yourself" failure case grab mount_lock, check if MNT_DOOMED has been set by racing final mntput() that has missed our increment and if it has - undo the increment and treat that as "failure, caller doesn't need to drop anything" case. It's not easy to hit - the final mntput() has to come right after the first read_seqretry() in __legitimize_mnt() *and* manage to miss the increment done by __legitimize_mnt() before the second read_seqretry() in there. The things that are almost impossible to hit on bare hardware are not impossible on SMP KVM, though... Reported-by: Oleg Nesterov <oleg@redhat.com> Fixes: 48a066e72d97 ("RCU'd vsfmounts") Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-15fix mntput/mntput raceAl Viro
commit 9ea0a46ca2c318fcc449c1e6b62a7230a17888f1 upstream. mntput_no_expire() does the calculation of total refcount under mount_lock; unfortunately, the decrement (as well as all increments) are done outside of it, leading to false positives in the "are we dropping the last reference" test. Consider the following situation: * mnt is a lazy-umounted mount, kept alive by two opened files. One of those files gets closed. Total refcount of mnt is 2. On CPU 42 mntput(mnt) (called from __fput()) drops one reference, decrementing component * After it has looked at component #0, the process on CPU 0 does mntget(), incrementing component #0, gets preempted and gets to run again - on CPU 69. There it does mntput(), which drops the reference (component #69) and proceeds to spin on mount_lock. * On CPU 42 our first mntput() finishes counting. It observes the decrement of component #69, but not the increment of component #0. As the result, the total it gets is not 1 as it should've been - it's 0. At which point we decide that vfsmount needs to be killed and proceed to free it and shut the filesystem down. However, there's still another opened file on that filesystem, with reference to (now freed) vfsmount, etc. and we are screwed. It's not a wide race, but it can be reproduced with artificial slowdown of the mnt_get_count() loop, and it should be easier to hit on SMP KVM setups. Fix consists of moving the refcount decrement under mount_lock; the tricky part is that we want (and can) keep the fast case (i.e. mount that still has non-NULL ->mnt_ns) entirely out of mount_lock. All places that zero mnt->mnt_ns are dropping some reference to mnt and they call synchronize_rcu() before that mntput(). IOW, if mntput() observes (under rcu_read_lock()) a non-NULL ->mnt_ns, it is guaranteed that there is another reference yet to be dropped. Reported-by: Jann Horn <jannh@google.com> Tested-by: Jann Horn <jannh@google.com> Fixes: 48a066e72d97 ("RCU'd vsfmounts") Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-15make sure that __dentry_kill() always invalidates d_seq, unhashed or notAl Viro
commit 4c0d7cd5c8416b1ef41534d19163cb07ffaa03ab upstream. RCU pathwalk relies upon the assumption that anything that changes ->d_inode of a dentry will invalidate its ->d_seq. That's almost true - the one exception is that the final dput() of already unhashed dentry does *not* touch ->d_seq at all. Unhashing does, though, so for anything we'd found by RCU dcache lookup we are fine. Unfortunately, we can *start* with an unhashed dentry or jump into it. We could try and be careful in the (few) places where that could happen. Or we could just make the final dput() invalidate the damn thing, unhashed or not. The latter is much simpler and easier to backport, so let's do it that way. Reported-by: "Dae R. Jeong" <threeearcat@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-15root dentries need RCU-delayed freeingAl Viro
commit 90bad5e05bcdb0308cfa3d3a60f5c0b9c8e2efb3 upstream. Since mountpoint crossing can happen without leaving lazy mode, root dentries do need the same protection against having their memory freed without RCU delay as everything else in the tree. It's partially hidden by RCU delay between detaching from the mount tree and dropping the vfsmount reference, but the starting point of pathwalk can be on an already detached mount, in which case umount-caused RCU delay has already passed by the time the lazy pathwalk grabs rcu_read_lock(). If the starting point happens to be at the root of that vfsmount *and* that vfsmount covers the entire filesystem, we get trouble. Fixes: 48a066e72d97 ("RCU'd vsfmounts") Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-15ext4: fix check to prevent initializing reserved inodesTheodore Ts'o
commit 5012284700775a4e6e3fbe7eac4c543c4874b559 upstream. Commit 8844618d8aa7: "ext4: only look at the bg_flags field if it is valid" will complain if block group zero does not have the EXT4_BG_INODE_ZEROED flag set. Unfortunately, this is not correct, since a freshly created file system has this flag cleared. It gets almost immediately after the file system is mounted read-write --- but the following somewhat unlikely sequence will end up triggering a false positive report of a corrupted file system: mkfs.ext4 /dev/vdc mount -o ro /dev/vdc /vdc mount -o remount,rw /dev/vdc Instead, when initializing the inode table for block group zero, test to make sure that itable_unused count is not too large, since that is the case that will result in some or all of the reserved inodes getting cleared. This fixes the failures reported by Eric Whiteney when running generic/230 and generic/231 in the the nojournal test case. Fixes: 8844618d8aa7 ("ext4: only look at the bg_flags field if it is valid") Reported-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-09jfs: Fix inconsistency between memory allocation and ea_buf->max_sizeShankara Pailoor
commit 92d34134193e5b129dc24f8d79cb9196626e8d7a upstream. The code is assuming the buffer is max_size length, but we weren't allocating enough space for it. Signed-off-by: Shankara Pailoor <shankarapailoor@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-09Btrfs: fix file data corruption after cloning a range and fsyncFilipe Manana
commit bd3599a0e142cd73edd3b6801068ac3f48ac771a upstream. When we clone a range into a file we can end up dropping existing extent maps (or trimming them) and replacing them with new ones if the range to be cloned overlaps with a range in the destination inode. When that happens we add the new extent maps to the list of modified extents in the inode's extent map tree, so that a "fast" fsync (the flag BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps and log corresponding extent items. However, at the end of range cloning operation we do truncate all the pages in the affected range (in order to ensure future reads will not get stale data). Sometimes this truncation will release the corresponding extent maps besides the pages from the page cache. If this happens, then a "fast" fsync operation will miss logging some extent items, because it relies exclusively on the extent maps being present in the inode's extent tree, leading to data loss/corruption if the fsync ends up using the same transaction used by the clone operation (that transaction was not committed in the meanwhile). An extent map is released through the callback btrfs_invalidatepage(), which gets called by truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The later ends up calling try_release_extent_mapping() which will release the extent map if some conditions are met, like the file size being greater than 16Mb, gfp flags allow blocking and the range not being locked (which is the case during the clone operation) nor being the extent map flagged as pinned (also the case for cloning). The following example, turned into a test for fstests, reproduces the issue: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar $ xfs_io -c "fsync" /mnt/bar # reflink destination offset corresponds to the size of file bar, # 2728Kb minus 4Kb. $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar $ xfs_io -c "fsync" /mnt/bar $ md5sum /mnt/bar 95a95813a8c2abc9aa75a6c2914a077e /mnt/bar <power fail> $ mount /dev/sdb /mnt $ md5sum /mnt/bar 207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the # power failure In the above example, the destination offset of the clone operation corresponds to the size of the "bar" file minus 4Kb. So during the clone operation, the extent map covering the range from 2572Kb to 2728Kb gets trimmed so that it ends at offset 2724Kb, and a new extent map covering the range from 2724Kb to 11724Kb is created. So at the end of the clone operation when we ask to truncate the pages in the range from 2724Kb to 2724Kb + 15908Kb, the page invalidation callback ends up removing the new extent map (through try_release_extent_mapping()) when the page at offset 2724Kb is passed to that callback. Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent map is removed at try_release_extent_mapping(), forcing the next fsync to search for modified extents in the fs/subvolume tree instead of relying on the presence of extent maps in memory. This way we can continue doing a "fast" fsync if the destination range of a clone operation does not overlap with an existing range or if any of the criteria necessary to remove an extent map at try_release_extent_mapping() is not met (file size not bigger then 16Mb or gfp flags do not allow blocking). CC: stable@vger.kernel.org # 3.16+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-09ext4: fix false negatives *and* false positives in ext4_check_descriptors()Theodore Ts'o
commit 44de022c4382541cebdd6de4465d1f4f465ff1dd upstream. Ext4_check_descriptors() was getting called before s_gdb_count was initialized. So for file systems w/o the meta_bg feature, allocation bitmaps could overlap the block group descriptors and ext4 wouldn't notice. For file systems with the meta_bg feature enabled, there was a fencepost error which would cause the ext4_check_descriptors() to incorrectly believe that the block allocation bitmap overlaps with the block group descriptor blocks, and it would reject the mount. Fix both of these problems. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Signed-off-by: Benjamin Gilbert <bgilbert@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-06squashfs: more metadata hardeningsLinus Torvalds
commit 71755ee5350b63fb1f283de8561cdb61b47f4d1d upstream. The squashfs fragment reading code doesn't actually verify that the fragment is inside the fragment table. The end result _is_ verified to be inside the image when actually reading the fragment data, but before that is done, we may end up taking a page fault because the fragment table itself might not even exist. Another report from Anatoly and his endless squashfs image fuzzing. Reported-by: Анатолий Тросиненко <anatoly.trosinenko@gmail.com> Acked-by:: Phillip Lougher <phillip.lougher@gmail.com>, Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-06squashfs: more metadata hardeningLinus Torvalds
commit d512584780d3e6a7cacb2f482834849453d444a1 upstream. Anatoly reports another squashfs fuzzing issue, where the decompression parameters themselves are in a compressed block. This causes squashfs_read_data() to be called in order to read the decompression options before the decompression stream having been set up, making squashfs go sideways. Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Acked-by: Phillip Lougher <phillip.lougher@gmail.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03ext4: check for allocation block validity with block group lockedTheodore Ts'o
commit 8d5a803c6a6ce4ec258e31f76059ea5153ba46ef upstream. With commit 044e6e3d74a3: "ext4: don't update checksum of new initialized bitmaps" the buffer valid bit will get set without actually setting up the checksum for the allocation bitmap, since the checksum will get calculated once we actually allocate an inode or block. If we are doing this, then we need to (re-)check the verified bit after we take the block group lock. Otherwise, we could race with another process reading and verifying the bitmap, which would then complain about the checksum being invalid. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780137 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03ext4: fix inline data updates with checksums enabledTheodore Ts'o
commit 362eca70b53389bddf3143fe20f53dcce2cfdf61 upstream. The inline data code was updating the raw inode directly; this is problematic since if metadata checksums are enabled, ext4_mark_inode_dirty() must be called to update the inode's checksum. In addition, the jbd2 layer requires that get_write_access() be called before the metadata buffer is modified. Fix both of these problems. https://bugzilla.kernel.org/show_bug.cgi?id=200443 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03squashfs: be more careful about metadata corruptionLinus Torvalds
commit 01cfb7937a9af2abb1136c7e89fbf3fd92952956 upstream. Anatoly Trosinenko reports that a corrupted squashfs image can cause a kernel oops. It turns out that squashfs can end up being confused about negative fragment lengths. The regular squashfs_read_data() does check for negative lengths, but squashfs_read_metadata() did not, and the fragment size code just blindly trusted the on-disk value. Fix both the fragment parsing and the metadata reading code. Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Phillip Lougher <phillip@squashfs.org.uk> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03fscrypt: use unbound workqueue for decryptionEric Biggers
[ Upstream commit 36dd26e0c8d42699eeba87431246c07c28075bae ] Improve fscrypt read performance by switching the decryption workqueue from bound to unbound. With the bound workqueue, when multiple bios completed on the same CPU, they were decrypted on that same CPU. But with the unbound queue, they are now decrypted in parallel on any CPU. Although fscrypt read performance can be tough to measure due to the many sources of variation, this change is most beneficial when decryption is slow, e.g. on CPUs without AES instructions. For example, I timed tarring up encrypted directories on f2fs. On x86 with AES-NI instructions disabled, the unbound workqueue improved performance by about 25-35%, using 1 to NUM_CPUs jobs with 4 or 8 CPUs available. But with AES-NI enabled, performance was unchanged to within ~2%. I also did the same test on a quad-core ARM CPU using xts-speck128-neon encryption. There performance was usually about 10% better with the unbound workqueue, bringing it closer to the unencrypted speed. The unbound workqueue may be worse in some cases due to worse locality, but I think it's still the better default. dm-crypt uses an unbound workqueue by default too, so this change makes fscrypt match. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03btrfs: qgroup: Finish rescan when hit the last leaf of extent treeQu Wenruo
[ Upstream commit ff3d27a048d926b3920ccdb75d98788c567cae0d ] Under the following case, qgroup rescan can double account cowed tree blocks: In this case, extent tree only has one tree block. - | transid=5 last committed=4 | btrfs_qgroup_rescan_worker() | |- btrfs_start_transaction() | | transid = 5 | |- qgroup_rescan_leaf() | |- btrfs_search_slot_for_read() on extent tree | Get the only extent tree block from commit root (transid = 4). | Scan it, set qgroup_rescan_progress to the last | EXTENT/META_ITEM + 1 | now qgroup_rescan_progress = A + 1. | | fs tree get CoWed, new tree block is at A + 16K | transid 5 get committed - | transid=6 last committed=5 | btrfs_qgroup_rescan_worker() | btrfs_qgroup_rescan_worker() | |- btrfs_start_transaction() | | transid = 5 | |- qgroup_rescan_leaf() | |- btrfs_search_slot_for_read() on extent tree | Get the only extent tree block from commit root (transid = 5). | scan it using qgroup_rescan_progress (A + 1). | found new tree block beyong A, and it's fs tree block, | account it to increase qgroup numbers. - In above case, tree block A, and tree block A + 16K get accounted twice, while qgroup rescan should stop when it already reach the last leaf, other than continue using its qgroup_rescan_progress. Such case could happen by just looping btrfs/017 and with some possibility it can hit such double qgroup accounting problem. Fix it by checking the path to determine if we should finish qgroup rescan, other than relying on next loop to exit. Reported-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03btrfs: add barriers to btrfs_sync_log before log_commit_wait wakeupsDavid Sterba
[ Upstream commit 3d3a2e610ea5e7c6d4f9481ecce5d8e2d8317843 ] Currently the code assumes that there's an implied barrier by the sequence of code preceding the wakeup, namely the mutex unlock. As Nikolay pointed out: I think this is wrong (not your code) but the original assumption that the RELEASE semantics provided by mutex_unlock is sufficient. According to memory-barriers.txt: Section 'LOCK ACQUISITION FUNCTIONS' states: (2) RELEASE operation implication: Memory operations issued before the RELEASE will be completed before the RELEASE operation has completed. Memory operations issued after the RELEASE *may* be completed before the RELEASE operation has completed. (I've bolded the may portion) The example given there: As an example, consider the following: *A = a; *B = b; ACQUIRE *C = c; *D = d; RELEASE *E = e; *F = f; The following sequence of events is acceptable: ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE So if we assume that *C is modifying the flag which the waitqueue is checking, and *E is the actual wakeup, then those accesses can be re-ordered... IMHO this code should be considered broken... Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03f2fs: fix race in between GC and atomic openChao Yu
[ Upstream commit 27319ba4044c0c67d62ae39e53c0118c89f0a029 ] Thread GC thread - f2fs_ioc_start_atomic_write - get_dirty_pages - filemap_write_and_wait_range - f2fs_gc - do_garbage_collect - gc_data_segment - move_data_page - f2fs_is_atomic_file - set_page_dirty - set_inode_flag(, FI_ATOMIC_FILE) Dirty data page can still be generated by GC in race condition as above call stack. This patch adds fi->dio_rwsem[WRITE] in f2fs_ioc_start_atomic_write to avoid such race. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03f2fs: Fix deadlock in shutdown ioctlSahitya Tummala
[ Upstream commit 60b2b4ee2bc01dd052f99fa9d65da2232102ef8e ] f2fs_ioc_shutdown() ioctl gets stuck in the below path when issued with F2FS_GOING_DOWN_FULLSYNC option. __switch_to+0x90/0xc4 percpu_down_write+0x8c/0xc0 freeze_super+0xec/0x1e4 freeze_bdev+0xc4/0xcc f2fs_ioctl+0xc0c/0x1ce0 f2fs_compat_ioctl+0x98/0x1f0 Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03f2fs: fix to wait page writeback during revoking atomic writeChao Yu
[ Upstream commit e5e5732d8120654159254c16834bc8663d8be124 ] After revoking atomic write, related LBA can be reused by others, so we need to wait page writeback before reusing the LBA, in order to avoid interference between old atomic written in-flight IO and new IO. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03f2fs: fix to don't trigger writeback during recoveryChao Yu
[ Upstream commit 64c74a7ab505ea40d1b3e5d02735ecab08ae1b14 ] - f2fs_fill_super - recover_fsync_data - recover_data - del_fsync_inode - iput - iput_final - write_inode_now - f2fs_write_inode - f2fs_balance_fs - f2fs_balance_fs_bg - sync_dirty_inodes With data_flush mount option, during recovery, in order to avoid entering above writeback flow, let's detect recovery status and do skip in f2fs_balance_fs_bg. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03f2fs: fix error path of move_data_pageChao Yu
[ Upstream commit 14a28559f43ac7c0b98dd1b0e73ec9ec8ab4fc45 ] This patch fixes error path of move_data_page: - clear cold data flag if it fails to write page. - redirty page for non-ENOMEM case. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03disable loading f2fs module on PAGE_SIZE > 4KBAnatoly Pugachev
[ Upstream commit 4071e67cffcc5c2a007116a02437471351f550eb ] The following patch disables loading of f2fs module on architectures which have PAGE_SIZE > 4096 , since it is impossible to mount f2fs on such architectures , log messages are: mount: /mnt: wrong fs type, bad option, bad superblock on /dev/vdiskb1, missing codepage or helper program, or other error. /dev/vdiskb1: F2FS filesystem, UUID=1d8b9ca4-2389-4910-af3b-10998969f09c, volume name "" May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid page_cache_size (8192), supports only 4KB May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Can't find valid F2FS filesystem in 1th superblock May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid page_cache_size (8192), supports only 4KB May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Can't find valid F2FS filesystem in 2th superblock May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid page_cache_size (8192), supports only 4KB which was introduced by git commit 5c9b469295fb6b10d98923eab5e79c4edb80ed20 tested on git kernel 4.17.0-rc6-00309-gec30dcf7f425 with patch applied: modprobe: ERROR: could not insert 'f2fs': Invalid argument May 28 01:40:28 v215 kernel: F2FS not supported on PAGE_SIZE(8192) != 4096 Signed-off-by: Anatoly Pugachev <matorola@gmail.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03pnfs: Don't release the sequence slot until we've processed layoutget on openTrond Myklebust
[ Upstream commit ae55e59da0e401893b3c52b575fc18a00623d0a1 ] If the server recalls the layout that was just handed out, we risk hitting a race as described in RFC5661 Section 2.10.6.3 unless we ensure that we release the sequence slot after processing the LAYOUTGET operation that was sent as part of the OPEN compound. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03nfsd: fix potential use-after-free in nfsd4_decode_getdeviceinfoScott Mayhew
[ Upstream commit 3171822fdcdd6e6d536047c425af6dc7a92dc585 ] When running a fuzz tester against a KASAN-enabled kernel, the following splat periodically occurs. The problem occurs when the test sends a GETDEVICEINFO request with a malformed xdr array (size but no data) for gdia_notify_types and the array size is > 0x3fffffff, which results in an overflow in the value of nbytes which is passed to read_buf(). If the array size is 0x40000000, 0x80000000, or 0xc0000000, then after the overflow occurs, the value of nbytes 0, and when that happens the pointer returned by read_buf() points to the end of the xdr data (i.e. argp->end) when really it should be returning NULL. Fix this by returning NFS4ERR_BAD_XDR if the array size is > 1000 (this value is arbitrary, but it's the same threshold used by nfsd4_decode_bitmap()... in could really be any value >= 1 since it's expected to get at most a single bitmap in gdia_notify_types). [ 119.256854] ================================================================== [ 119.257611] BUG: KASAN: use-after-free in nfsd4_decode_getdeviceinfo+0x5a4/0x5b0 [nfsd] [ 119.258422] Read of size 4 at addr ffff880113ada000 by task nfsd/538 [ 119.259146] CPU: 0 PID: 538 Comm: nfsd Not tainted 4.17.0+ #1 [ 119.259662] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014 [ 119.261202] Call Trace: [ 119.262265] dump_stack+0x71/0xab [ 119.263371] print_address_description+0x6a/0x270 [ 119.264609] kasan_report+0x258/0x380 [ 119.265854] ? nfsd4_decode_getdeviceinfo+0x5a4/0x5b0 [nfsd] [ 119.267291] nfsd4_decode_getdeviceinfo+0x5a4/0x5b0 [nfsd] [ 119.268549] ? nfs4svc_decode_compoundargs+0xa5b/0x13c0 [nfsd] [ 119.269873] ? nfsd4_decode_sequence+0x490/0x490 [nfsd] [ 119.271095] nfs4svc_decode_compoundargs+0xa5b/0x13c0 [nfsd] [ 119.272393] ? nfsd4_release_compoundargs+0x1b0/0x1b0 [nfsd] [ 119.273658] nfsd_dispatch+0x183/0x850 [nfsd] [ 119.274918] svc_process+0x161c/0x31a0 [sunrpc] [ 119.276172] ? svc_printk+0x190/0x190 [sunrpc] [ 119.277386] ? svc_xprt_release+0x451/0x680 [sunrpc] [ 119.278622] nfsd+0x2b9/0x430 [nfsd] [ 119.279771] ? nfsd_destroy+0x1c0/0x1c0 [nfsd] [ 119.281157] kthread+0x2db/0x390 [ 119.282347] ? kthread_create_worker_on_cpu+0xc0/0xc0 [ 119.283756] ret_from_fork+0x35/0x40 [ 119.286041] Allocated by task 436: [ 119.287525] kasan_kmalloc+0xa0/0xd0 [ 119.288685] kmem_cache_alloc+0xe9/0x1f0 [ 119.289900] get_empty_filp+0x7b/0x410 [ 119.291037] path_openat+0xca/0x4220 [ 119.292242] do_filp_open+0x182/0x280 [ 119.293411] do_sys_open+0x216/0x360 [ 119.294555] do_syscall_64+0xa0/0x2f0 [ 119.295721] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 119.298068] Freed by task 436: [ 119.299271] __kasan_slab_free+0x130/0x180 [ 119.300557] kmem_cache_free+0x78/0x210 [ 119.301823] rcu_process_callbacks+0x35b/0xbd0 [ 119.303162] __do_softirq+0x192/0x5ea [ 119.305443] The buggy address belongs to the object at ffff880113ada000 which belongs to the cache filp of size 256 [ 119.308556] The buggy address is located 0 bytes inside of 256-byte region [ffff880113ada000, ffff880113ada100) [ 119.311376] The buggy address belongs to the page: [ 119.312728] page:ffffea00044eb680 count:1 mapcount:0 mapping:0000000000000000 index:0xffff880113ada780 [ 119.314428] flags: 0x17ffe000000100(slab) [ 119.315740] raw: 0017ffe000000100 0000000000000000 ffff880113ada780 00000001000c0001 [ 119.317379] raw: ffffea0004553c60 ffffea00045c11e0 ffff88011b167e00 0000000000000000 [ 119.319050] page dumped because: kasan: bad access detected [ 119.321652] Memory state around the buggy address: [ 119.322993] ffff880113ad9f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 119.324515] ffff880113ad9f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 119.326087] >ffff880113ada000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 119.327547] ^ [ 119.328730] ffff880113ada080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 119.330218] ffff880113ada100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb [ 119.331740] ================================================================== Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-28exec: avoid gcc-8 warning for get_task_commArnd Bergmann
commit 3756f6401c302617c5e091081ca4d26ab604bec5 upstream. gcc-8 warns about using strncpy() with the source size as the limit: fs/exec.c:1223:32: error: argument to 'sizeof' in 'strncpy' call is the same expression as the source; did you mean to use the size of the destination? [-Werror=sizeof-pointer-memaccess] This is indeed slightly suspicious, as it protects us from source arguments without NUL-termination, but does not guarantee that the destination is terminated. This keeps the strncpy() to ensure we have properly padded target buffer, but ensures that we use the correct length, by passing the actual length of the destination buffer as well as adding a build-time check to ensure it is exactly TASK_COMM_LEN. There are only 23 callsites which I all reviewed to ensure this is currently the case. We could get away with doing only the check or passing the right length, but it doesn't hurt to do both. Link: http://lkml.kernel.org/r/20171205151724.1764896-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Kees Cook <keescook@chromium.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Serge Hallyn <serge@hallyn.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Aleksa Sarai <asarai@suse.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-25fat: fix memory allocation failure handling of match_strdup()OGAWA Hirofumi
commit 35033ab988c396ad7bce3b6d24060c16a9066db8 upstream. In parse_options(), if match_strdup() failed, parse_options() leaves opts->iocharset in unexpected state (i.e. still pointing the freed string). And this can be the cause of double free. To fix, this initialize opts->iocharset always when freeing. Link: http://lkml.kernel.org/r/8736wp9dzc.fsf@mail.parknet.co.jp Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reported-by: syzbot+90b8e10515ae88228a92@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-22reiserfs: fix buffer overflow with long warning messagesEric Biggers
commit fe10e398e860955bac4d28ec031b701d358465e4 upstream. ReiserFS prepares log messages into a 1024-byte buffer with no bounds checks. Long messages, such as the "unknown mount option" warning when userspace passes a crafted mount options string, overflow this buffer. This causes KASAN to report a global-out-of-bounds write. Fix it by truncating messages to the buffer size. Link: http://lkml.kernel.org/r/20180707203621.30922-1-ebiggers3@gmail.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+b890b3335a4d8c608963@syzkaller.appspotmail.com Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-22ocfs2: ip_alloc_sem should be taken in ocfs2_get_block()alex chen
commit 3e4c56d41eef5595035872a2ec5a483f42e8917f upstream. ip_alloc_sem should be taken in ocfs2_get_block() when reading file in DIRECT mode to prevent concurrent access to extent tree with ocfs2_dio_end_io_write(), which may cause BUGON in the following situation: read file 'A' end_io of writing file 'A' vfs_read __vfs_read ocfs2_file_read_iter generic_file_read_iter ocfs2_direct_IO __blockdev_direct_IO do_blockdev_direct_IO do_direct_IO get_more_blocks ocfs2_get_block ocfs2_extent_map_get_blocks ocfs2_get_clusters ocfs2_get_clusters_nocache() ocfs2_search_extent_list return the index of record which contains the v_cluster, that is v_cluster > rec[i]->e_cpos. ocfs2_dio_end_io ocfs2_dio_end_io_write down_write(&oi->ip_alloc_sem); ocfs2_mark_extent_written ocfs2_change_extent_flag ocfs2_split_extent ... --> modify the rec[i]->e_cpos, resulting in v_cluster < rec[i]->e_cpos. BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos)) [alex.chen@huawei.com: v3] Link: http://lkml.kernel.org/r/59EF3614.6050008@huawei.com Link: http://lkml.kernel.org/r/59EF3614.6050008@huawei.com Fixes: c15471f79506 ("ocfs2: fix sparse file & data ordering issue in direct io") Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Reviewed-by: Gang He <ghe@suse.com> Acked-by: Changwei Ge <ge.changwei@h3c.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-22ocfs2: subsystem.su_mutex is required while accessing the item->ci_parentalex chen
commit 853bc26a7ea39e354b9f8889ae7ad1492ffa28d2 upstream. The subsystem.su_mutex is required while accessing the item->ci_parent, otherwise, NULL pointer dereference to the item->ci_parent will be triggered in the following situation: add node delete node sys_write vfs_write configfs_write_file o2nm_node_store o2nm_node_local_write do_rmdir vfs_rmdir configfs_rmdir mutex_lock(&subsys->su_mutex); unlink_obj item->ci_group = NULL; item->ci_parent = NULL; to_o2nm_cluster_from_node node->nd_item.ci_parent->ci_parent BUG since of NULL pointer dereference to nd_item.ci_parent Moreover, the o2nm_cluster also should be protected by the subsystem.su_mutex. [alex.chen@huawei.com: v2] Link: http://lkml.kernel.org/r/59EEAA69.9080703@huawei.com Link: http://lkml.kernel.org/r/59E9B36A.10700@huawei.com Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-17fs, elf: make sure to page align bss in load_elf_libraryOscar Salvador
commit 24962af7e1041b7e50c1bc71d8d10dc678c556b5 upstream. The current code does not make sure to page align bss before calling vm_brk(), and this can lead to a VM_BUG_ON() in __mm_populate() due to the requested lenght not being correctly aligned. Let us make sure to align it properly. Kees: only applicable to CONFIG_USELIB kernels: 32-bit and configured for libc5. Link: http://lkml.kernel.org/r/20180705145539.9627-1-osalvador@techadventures.net Signed-off-by: Oscar Salvador <osalvador@suse.de> Reported-by: syzbot+5dcb560fe12aa5091c06@syzkaller.appspotmail.com Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Acked-by: Kees Cook <keescook@chromium.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-17Fix up non-directory creation in SGID directoriesLinus Torvalds
commit 0fa3ecd87848c9c93c2c828ef4c3a8ca36ce46c7 upstream. sgid directories have special semantics, making newly created files in the directory belong to the group of the directory, and newly created subdirectories will also become sgid. This is historically used for group-shared directories. But group directories writable by non-group members should not imply that such non-group members can magically join the group, so make sure to clear the sgid bit on non-directories for non-members (but remember that sgid without group execute means "mandatory locking", just to confuse things even more). Reported-by: Jann Horn <jannh@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11ext4: check superblock mapped prior to committingJon Derrick
commit a17712c8e4be4fa5404d20e9cd3b2b21eae7bc56 upstream. This patch attempts to close a hole leading to a BUG seen with hot removals during writes [1]. A block device (NVME namespace in this test case) is formatted to EXT4 without partitions. It's mounted and write I/O is run to a file, then the device is hot removed from the slot. The superblock attempts to be written to the drive which is no longer present. The typical chain of events leading to the BUG: ext4_commit_super() __sync_dirty_buffer() submit_bh() submit_bh_wbc() BUG_ON(!buffer_mapped(bh)); This fix checks for the superblock's buffer head being mapped prior to syncing. [1] https://www.spinics.net/lists/linux-ext4/msg56527.html Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11ext4: add more mount time checks of the superblockTheodore Ts'o
commit bfe0a5f47ada40d7984de67e59a7d3390b9b9ecc upstream. The kernel's ext4 mount-time checks were more permissive than e2fsprogs's libext2fs checks when opening a file system. The superblock is considered too insane for debugfs or e2fsck to operate on it, the kernel has no business trying to mount it. This will make file system fuzzing tools work harder, but the failure cases that they find will be more useful and be easier to evaluate. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11ext4: add more inode number paranoia checksTheodore Ts'o
commit c37e9e013469521d9adb932d17a1795c139b36db upstream. If there is a directory entry pointing to a system inode (such as a journal inode), complain and declare the file system to be corrupted. Also, if the superblock's first inode number field is too small, refuse to mount the file system. This addresses CVE-2018-10882. https://bugzilla.kernel.org/show_bug.cgi?id=200069 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11ext4: clear i_data in ext4_inode_info when removing inline dataTheodore Ts'o
commit 6e8ab72a812396996035a37e5ca4b3b99b5d214b upstream. When converting from an inode from storing the data in-line to a data block, ext4_destroy_inline_data_nolock() was only clearing the on-disk copy of the i_blocks[] array. It was not clearing copy of the i_blocks[] in ext4_inode_info, in i_data[], which is the copy actually used by ext4_map_blocks(). This didn't matter much if we are using extents, since the extents header would be invalid and thus the extents could would re-initialize the extents tree. But if we are using indirect blocks, the previous contents of the i_blocks array will be treated as block numbers, with potentially catastrophic results to the file system integrity and/or user data. This gets worse if the file system is using a 1k block size and s_first_data is zero, but even without this, the file system can get quite badly corrupted. This addresses CVE-2018-10881. https://bugzilla.kernel.org/show_bug.cgi?id=200015 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11ext4: include the illegal physical block in the bad map ext4_error msgTheodore Ts'o
commit bdbd6ce01a70f02e9373a584d0ae9538dcf0a121 upstream. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11ext4: verify the depth of extent tree in ext4_find_extent()Theodore Ts'o
commit bc890a60247171294acc0bd67d211fa4b88d40ba upstream. If there is a corupted file system where the claimed depth of the extent tree is -1, this can cause a massive buffer overrun leading to sadness. This addresses CVE-2018-10877. https://bugzilla.kernel.org/show_bug.cgi?id=199417 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11ext4: only look at the bg_flags field if it is validTheodore Ts'o
commit 8844618d8aa7a9973e7b527d038a2a589665002c upstream. The bg_flags field in the block group descripts is only valid if the uninit_bg or metadata_csum feature is enabled. We were not consistently looking at this field; fix this. Also block group #0 must never have uninitialized allocation bitmaps, or need to be zeroed, since that's where the root inode, and other special inodes are set up. Check for these conditions and mark the file system as corrupted if they are detected. This addresses CVE-2018-10876. https://bugzilla.kernel.org/show_bug.cgi?id=199403 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11ext4: always check block group bounds in ext4_init_block_bitmap()Theodore Ts'o
commit 819b23f1c501b17b9694325471789e6b5cc2d0d2 upstream. Regardless of whether the flex_bg feature is set, we should always check to make sure the bits we are setting in the block bitmap are within the block group bounds. https://bugzilla.kernel.org/show_bug.cgi?id=199865 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11ext4: make sure bitmaps and the inode table don't overlap with bg descriptorsTheodore Ts'o
commit 77260807d1170a8cf35dbb06e07461a655f67eee upstream. It's really bad when the allocation bitmaps and the inode table overlap with the block group descriptors, since it causes random corruption of the bg descriptors. So we really want to head those off at the pass. https://bugzilla.kernel.org/show_bug.cgi?id=199865 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11jbd2: don't mark block as modified if the handle is out of creditsTheodore Ts'o
commit e09463f220ca9a1a1ecfda84fcda658f99a1f12a upstream. Do not set the b_modified flag in block's journal head should not until after we're sure that jbd2_journal_dirty_metadat() will not abort with an error due to there not being enough space reserved in the jbd2 handle. Otherwise, future attempts to modify the buffer may lead a large number of spurious errors and warnings. This addresses CVE-2018-10883. https://bugzilla.kernel.org/show_bug.cgi?id=200071 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11cifs: Fix infinite loop when using hard mount optionPaulo Alcantara
commit 7ffbe65578b44fafdef577a360eb0583929f7c6e upstream. For every request we send, whether it is SMB1 or SMB2+, we attempt to reconnect tcon (cifs_reconnect_tcon or smb2_reconnect) before carrying out the request. So, while server->tcpStatus != CifsNeedReconnect, we wait for the reconnection to succeed on wait_event_interruptible_timeout(). If it returns, that means that either the condition was evaluated to true, or timeout elapsed, or it was interrupted by a signal. Since we're not handling the case where the process woke up due to a received signal (-ERESTARTSYS), the next call to wait_event_interruptible_timeout() will _always_ fail and we end up looping forever inside either cifs_reconnect_tcon() or smb2_reconnect(). Here's an example of how to trigger that: $ mount.cifs //foo/share /mnt/test -o username=foo,password=foo,vers=1.0,hard (break connection to server before executing bellow cmd) $ stat -f /mnt/test & sleep 140 [1] 2511 $ ps -aux -q 2511 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 2511 0.0 0.0 12892 1008 pts/0 S 12:24 0:00 stat -f /mnt/test $ kill -9 2511 (wait for a while; process is stuck in the kernel) $ ps -aux -q 2511 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 2511 83.2 0.0 12892 1008 pts/0 R 12:24 30:01 stat -f /mnt/test By using 'hard' mount point means that cifs.ko will keep retrying indefinitely, however we must allow the process to be killed otherwise it would hang the system. Signed-off-by: Paulo Alcantara <palcantara@suse.de> Cc: stable@vger.kernel.org Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03udf: Detect incorrect directory sizeJan Kara
commit fa65653e575fbd958bdf5fb9c4a71a324e39510d upstream. Detect when a directory entry is (possibly partially) beyond directory size and return EIO in that case since it means the filesystem is corrupted. Otherwise directory operations can further corrupt the directory and possibly also oops the kernel. CC: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> CC: stable@vger.kernel.org Reported-and-tested-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03NFSv4: Revert commit 5f83d86cf531d ("NFSv4.x: Fix wraparound issues..")Trond Myklebust
commit fc40724fc6731d90cc7fb6d62d66135f85a33dd2 upstream. The correct behaviour for NFSv4 sequence IDs is to wrap around to the value 0 after 0xffffffff. See https://tools.ietf.org/html/rfc5661#section-2.10.6.1 Fixes: 5f83d86cf531d ("NFSv4.x: Fix wraparound issues when validing...") Cc: stable@vger.kernel.org # 4.6+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03NFSv4: Fix possible 1-byte stack overflow in nfs_idmap_read_and_verify_messageDave Wysochanski
commit d68894800ec5712d7ddf042356f11e36f87d7f78 upstream. In nfs_idmap_read_and_verify_message there is an incorrect sprintf '%d' that converts the __u32 'im_id' from struct idmap_msg to 'id_str', which is a stack char array variable of length NFS_UINT_MAXLEN == 11. If a uid or gid value is > 2147483647 = 0x7fffffff, the conversion overflows into a negative value, for example: crash> p (unsigned) (0x80000000) $1 = 2147483648 crash> p (signed) (0x80000000) $2 = -2147483648 The '-' sign is written to the buffer and this causes a 1 byte overflow when the NULL byte is written, which corrupts kernel stack memory. If CONFIG_CC_STACKPROTECTOR_STRONG is set we see a stack-protector panic: [11558053.616565] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffffa05b8a8c [11558053.639063] CPU: 6 PID: 9423 Comm: rpc.idmapd Tainted: G W ------------ T 3.10.0-514.el7.x86_64 #1 [11558053.641990] Hardware name: Red Hat OpenStack Compute, BIOS 1.10.2-3.el7_4.1 04/01/2014 [11558053.644462] ffffffff818c7bc0 00000000b1f3aec1 ffff880de0f9bd48 ffffffff81685eac [11558053.646430] ffff880de0f9bdc8 ffffffff8167f2b3 ffffffff00000010 ffff880de0f9bdd8 [11558053.648313] ffff880de0f9bd78 00000000b1f3aec1 ffffffff811dcb03 ffffffffa05b8a8c [11558053.650107] Call Trace: [11558053.651347] [<ffffffff81685eac>] dump_stack+0x19/0x1b [11558053.653013] [<ffffffff8167f2b3>] panic+0xe3/0x1f2 [11558053.666240] [<ffffffff811dcb03>] ? kfree+0x103/0x140 [11558053.682589] [<ffffffffa05b8a8c>] ? idmap_pipe_downcall+0x1cc/0x1e0 [nfsv4] [11558053.689710] [<ffffffff810855db>] __stack_chk_fail+0x1b/0x30 [11558053.691619] [<ffffffffa05b8a8c>] idmap_pipe_downcall+0x1cc/0x1e0 [nfsv4] [11558053.693867] [<ffffffffa00209d6>] rpc_pipe_write+0x56/0x70 [sunrpc] [11558053.695763] [<ffffffff811fe12d>] vfs_write+0xbd/0x1e0 [11558053.702236] [<ffffffff810acccc>] ? task_work_run+0xac/0xe0 [11558053.704215] [<ffffffff811fec4f>] SyS_write+0x7f/0xe0 [11558053.709674] [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b Fix this by calling the internally defined nfs_map_numeric_to_string() function which properly uses '%u' to convert this __u32. For consistency, also replace the one other place where snprintf is called. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Reported-by: Stephen Johnston <sjohnsto@redhat.com> Fixes: cf4ab538f1516 ("NFSv4: Fix the string length returned by the idmapper") Cc: stable@vger.kernel.org # v3.4+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03nfsd: restrict rd_maxcount to svc_max_payload in nfsd_encode_readdirScott Mayhew
commit 9c2ece6ef67e9d376f32823086169b489c422ed0 upstream. nfsd4_readdir_rsize restricts rd_maxcount to svc_max_payload when estimating the size of the readdir reply, but nfsd_encode_readdir restricts it to INT_MAX when encoding the reply. This can result in log messages like "kernel: RPC request reserved 32896 but used 1049444". Restrict rd_dircount similarly (no reason it should be larger than svc_max_payload). Signed-off-by: Scott Mayhew <smayhew@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>