diff options
Diffstat (limited to 'Documentation')
28 files changed, 797 insertions, 222 deletions
diff --git a/Documentation/DMA-API-HOWTO.txt b/Documentation/DMA-API-HOWTO.txt index 3c4e07123e59..d568bc235bc0 100644 --- a/Documentation/DMA-API-HOWTO.txt +++ b/Documentation/DMA-API-HOWTO.txt @@ -738,17 +738,17 @@ to "Closing". CONFIG_NEED_SG_DMA_LENGTH if the architecture supports IOMMUs (including software IOMMU). -2) ARCH_KMALLOC_MINALIGN +2) ARCH_DMA_MINALIGN Architectures must ensure that kmalloc'ed buffer is DMA-safe. Drivers and subsystems depend on it. If an architecture isn't fully DMA-coherent (i.e. hardware doesn't ensure that data in the CPU cache is identical to data in main memory), - ARCH_KMALLOC_MINALIGN must be set so that the memory allocator + ARCH_DMA_MINALIGN must be set so that the memory allocator makes sure that kmalloc'ed buffer doesn't share a cache line with the others. See arch/arm/include/asm/cache.h as an example. - Note that ARCH_KMALLOC_MINALIGN is about DMA memory alignment + Note that ARCH_DMA_MINALIGN is about DMA memory alignment constraints. You don't need to worry about the architecture data alignment constraints (e.g. the alignment constraints about 64-bit objects). diff --git a/Documentation/DocBook/device-drivers.tmpl b/Documentation/DocBook/device-drivers.tmpl index ecd35e9d4410..feca0758391e 100644 --- a/Documentation/DocBook/device-drivers.tmpl +++ b/Documentation/DocBook/device-drivers.tmpl @@ -46,7 +46,6 @@ <sect1><title>Atomic and pointer manipulation</title> !Iarch/x86/include/asm/atomic.h -!Iarch/x86/include/asm/unaligned.h </sect1> <sect1><title>Delaying, scheduling, and timer routines</title> diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index a20c6f6fffc3..6899f471fb15 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -57,7 +57,6 @@ </para> <sect1><title>String Conversions</title> -!Ilib/vsprintf.c !Elib/vsprintf.c </sect1> <sect1><title>String Manipulation</title> diff --git a/Documentation/DocBook/kernel-locking.tmpl b/Documentation/DocBook/kernel-locking.tmpl index 084f6ad7b7a0..a0d479d1e1dd 100644 --- a/Documentation/DocBook/kernel-locking.tmpl +++ b/Documentation/DocBook/kernel-locking.tmpl @@ -1922,9 +1922,12 @@ machines due to caching. <function>mutex_lock()</function> </para> <para> - There is a <function>mutex_trylock()</function> which can be - used inside interrupt context, as it will not sleep. + There is a <function>mutex_trylock()</function> which does not + sleep. Still, it must not be used inside interrupt context since + its implementation is not safe for that. <function>mutex_unlock()</function> will also never sleep. + It cannot be used in interrupt context either since a mutex + must be released by the same task that acquired it. </para> </listitem> </itemizedlist> @@ -1958,6 +1961,12 @@ machines due to caching. </sect1> </chapter> + <chapter id="apiref"> + <title>Mutex API reference</title> +!Iinclude/linux/mutex.h +!Ekernel/mutex.c + </chapter> + <chapter id="references"> <title>Further reading</title> diff --git a/Documentation/DocBook/tracepoint.tmpl b/Documentation/DocBook/tracepoint.tmpl index e8473eae2a20..b57a9ede3224 100644 --- a/Documentation/DocBook/tracepoint.tmpl +++ b/Documentation/DocBook/tracepoint.tmpl @@ -104,4 +104,9 @@ <title>Block IO</title> !Iinclude/trace/events/block.h </chapter> + + <chapter id="workqueue"> + <title>Workqueue</title> +!Iinclude/trace/events/workqueue.h + </chapter> </book> diff --git a/Documentation/acpi/method-customizing.txt b/Documentation/acpi/method-customizing.txt index e628cd23ca80..3e1d25aee3fb 100644 --- a/Documentation/acpi/method-customizing.txt +++ b/Documentation/acpi/method-customizing.txt @@ -19,6 +19,8 @@ Note: Only ACPI METHOD can be overridden, any other object types like "Device", "OperationRegion", are not recognized. Note: The same ACPI control method can be overridden for many times, and it's always the latest one that used by Linux/kernel. +Note: To get the ACPI debug object output (Store (AAAA, Debug)), + please run "echo 1 > /sys/module/acpi/parameters/aml_debug_output". 1. override an existing method a) get the ACPI table via ACPI sysfs I/F. e.g. to get the DSDT, diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt new file mode 100644 index 000000000000..e578feed6d81 --- /dev/null +++ b/Documentation/block/cfq-iosched.txt @@ -0,0 +1,45 @@ +CFQ ioscheduler tunables +======================== + +slice_idle +---------- +This specifies how long CFQ should idle for next request on certain cfq queues +(for sequential workloads) and service trees (for random workloads) before +queue is expired and CFQ selects next queue to dispatch from. + +By default slice_idle is a non-zero value. That means by default we idle on +queues/service trees. This can be very helpful on highly seeky media like +single spindle SATA/SAS disks where we can cut down on overall number of +seeks and see improved throughput. + +Setting slice_idle to 0 will remove all the idling on queues/service tree +level and one should see an overall improved throughput on faster storage +devices like multiple SATA/SAS disks in hardware RAID configuration. The down +side is that isolation provided from WRITES also goes down and notion of +IO priority becomes weaker. + +So depending on storage and workload, it might be useful to set slice_idle=0. +In general I think for SATA/SAS disks and software RAID of SATA/SAS disks +keeping slice_idle enabled should be useful. For any configurations where +there are multiple spindles behind single LUN (Host based hardware RAID +controller or for storage arrays), setting slice_idle=0 might end up in better +throughput and acceptable latencies. + +CFQ IOPS Mode for group scheduling +=================================== +Basic CFQ design is to provide priority based time slices. Higher priority +process gets bigger time slice and lower priority process gets smaller time +slice. Measuring time becomes harder if storage is fast and supports NCQ and +it would be better to dispatch multiple requests from multiple cfq queues in +request queue at a time. In such scenario, it is not possible to measure time +consumed by single queue accurately. + +What is possible though is to measure number of requests dispatched from a +single queue and also allow dispatch from multiple cfq queue at the same time. +This effectively becomes the fairness in terms of IOPS (IO operations per +second). + +If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches +to IOPS mode and starts providing fairness in terms of number of requests +dispatched. Note that this mode switching takes effect only for group +scheduling. For non-cgroup users nothing should change. diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt index 48e0b21b0059..6919d62591d9 100644 --- a/Documentation/cgroups/blkio-controller.txt +++ b/Documentation/cgroups/blkio-controller.txt @@ -217,6 +217,7 @@ Details of cgroup files CFQ sysfs tunable ================= /sys/block/<disk>/queue/iosched/group_isolation +----------------------------------------------- If group_isolation=1, it provides stronger isolation between groups at the expense of throughput. By default group_isolation is 0. In general that @@ -243,6 +244,33 @@ By default one should run with group_isolation=0. If that is not sufficient and one wants stronger isolation between groups, then set group_isolation=1 but this will come at cost of reduced throughput. +/sys/block/<disk>/queue/iosched/slice_idle +------------------------------------------ +On a faster hardware CFQ can be slow, especially with sequential workload. +This happens because CFQ idles on a single queue and single queue might not +drive deeper request queue depths to keep the storage busy. In such scenarios +one can try setting slice_idle=0 and that would switch CFQ to IOPS +(IO operations per second) mode on NCQ supporting hardware. + +That means CFQ will not idle between cfq queues of a cfq group and hence be +able to driver higher queue depth and achieve better throughput. That also +means that cfq provides fairness among groups in terms of IOPS and not in +terms of disk time. + +/sys/block/<disk>/queue/iosched/group_idle +------------------------------------------ +If one disables idling on individual cfq queues and cfq service trees by +setting slice_idle=0, group_idle kicks in. That means CFQ will still idle +on the group in an attempt to provide fairness among groups. + +By default group_idle is same as slice_idle and does not do anything if +slice_idle is enabled. + +One can experience an overall throughput drop if you have created multiple +groups and put applications in that group which are not driving enough +IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle +on individual groups and throughput should improve. + What works ========== - Currently only sync IO queues are support. All the buffered writes are diff --git a/Documentation/gpio.txt b/Documentation/gpio.txt index d96a6dba5748..9633da01ff46 100644 --- a/Documentation/gpio.txt +++ b/Documentation/gpio.txt @@ -109,17 +109,19 @@ use numbers 2000-2063 to identify GPIOs in a bank of I2C GPIO expanders. If you want to initialize a structure with an invalid GPIO number, use some negative number (perhaps "-EINVAL"); that will never be valid. To -test if a number could reference a GPIO, you may use this predicate: +test if such number from such a structure could reference a GPIO, you +may use this predicate: int gpio_is_valid(int number); A number that's not valid will be rejected by calls which may request or free GPIOs (see below). Other numbers may also be rejected; for -example, a number might be valid but unused on a given board. - -Whether a platform supports multiple GPIO controllers is currently a -platform-specific implementation issue. +example, a number might be valid but temporarily unused on a given board. +Whether a platform supports multiple GPIO controllers is a platform-specific +implementation issue, as are whether that support can leave "holes" in the space +of GPIO numbers, and whether new controllers can be added at runtime. Such issues +can affect things including whether adjacent GPIO numbers are both valid. Using GPIOs ----------- @@ -480,12 +482,16 @@ To support this framework, a platform's Kconfig will "select" either ARCH_REQUIRE_GPIOLIB or ARCH_WANT_OPTIONAL_GPIOLIB and arrange that its <asm/gpio.h> includes <asm-generic/gpio.h> and defines three functions: gpio_get_value(), gpio_set_value(), and gpio_cansleep(). -They may also want to provide a custom value for ARCH_NR_GPIOS. -ARCH_REQUIRE_GPIOLIB means that the gpio-lib code will always get compiled +It may also provide a custom value for ARCH_NR_GPIOS, so that it better +reflects the number of GPIOs in actual use on that platform, without +wasting static table space. (It should count both built-in/SoC GPIOs and +also ones on GPIO expanders. + +ARCH_REQUIRE_GPIOLIB means that the gpiolib code will always get compiled into the kernel on that architecture. -ARCH_WANT_OPTIONAL_GPIOLIB means the gpio-lib code defaults to off and the user +ARCH_WANT_OPTIONAL_GPIOLIB means the gpiolib code defaults to off and the user can enable it and build it into the kernel optionally. If neither of these options are selected, the platform does not support diff --git a/Documentation/hwmon/emc2103 b/Documentation/hwmon/emc2103 new file mode 100644 index 000000000000..a12b2c127140 --- /dev/null +++ b/Documentation/hwmon/emc2103 @@ -0,0 +1,33 @@ +Kernel driver emc2103 +====================== + +Supported chips: + * SMSC EMC2103 + Addresses scanned: I2C 0x2e + Prefix: 'emc2103' + Datasheet: Not public + +Authors: + Steve Glendinning <steve.glendinning@smsc.com> + +Description +----------- + +The Standard Microsystems Corporation (SMSC) EMC2103 chips +contain up to 4 temperature sensors and a single fan controller. + +Fan rotation speeds are reported in RPM (rotations per minute). An alarm is +triggered if the rotation speed has dropped below a programmable limit. Fan +readings can be divided by a programmable divider (1, 2, 4 or 8) to give +the readings more range or accuracy. Not all RPM values can accurately be +represented, so some rounding is done. With a divider of 1, the lowest +representable value is 480 RPM. + +This driver supports RPM based control, to use this a fan target +should be written to fan1_target and pwm1_enable should be set to 3. + +The 2103-2 and 2103-4 variants have a third temperature sensor, which can +be connected to two anti-parallel diodes. These values can be read +as temp3 and temp4. If only one diode is attached to this channel, temp4 +will show as "fault". The module parameter "apd=0" can be used to suppress +this 4th channel when anti-parallel diodes are not fitted. diff --git a/Documentation/hwmon/f71882fg b/Documentation/hwmon/f71882fg index 1a07fd674cd0..a7952c2bd959 100644 --- a/Documentation/hwmon/f71882fg +++ b/Documentation/hwmon/f71882fg @@ -2,10 +2,6 @@ Kernel driver f71882fg ====================== Supported chips: - * Fintek F71808E - Prefix: 'f71808fg' - Addresses scanned: none, address read from Super I/O config space - Datasheet: Not public * Fintek F71858FG Prefix: 'f71858fg' Addresses scanned: none, address read from Super I/O config space diff --git a/Documentation/hwmon/ltc4245 b/Documentation/hwmon/ltc4245 index 86b5880d8502..b478b0864965 100644 --- a/Documentation/hwmon/ltc4245 +++ b/Documentation/hwmon/ltc4245 @@ -72,9 +72,31 @@ in6_min_alarm 5v output undervoltage alarm in7_min_alarm 3v output undervoltage alarm in8_min_alarm Vee (-12v) output undervoltage alarm -in9_input GPIO voltage data +in9_input GPIO voltage data (see note 1) +in10_input GPIO voltage data (see note 1) +in11_input GPIO voltage data (see note 1) power1_input 12v power usage (mW) power2_input 5v power usage (mW) power3_input 3v power usage (mW) power4_input Vee (-12v) power usage (mW) + + +Note 1 +------ + +If you have NOT configured the driver to sample all GPIO pins as analog +voltages, then the in10_input and in11_input sysfs attributes will not be +created. The driver will sample the GPIO pin that is currently connected to the +ADC as an analog voltage, and report the value in in9_input. + +If you have configured the driver to sample all GPIO pins as analog voltages, +then they will be sampled in round-robin fashion. If userspace reads too +slowly, -EAGAIN will be returned when you read the sysfs attribute containing +the sensor reading. + +The LTC4245 chip can be configured to sample all GPIO pins with two methods: +1) platform data -- see include/linux/i2c/ltc4245.h +2) OF device tree -- add the "ltc4245,use-extra-gpios" property to each chip + +The default mode of operation is to sample a single GPIO pin. diff --git a/Documentation/hwmon/pc87427 b/Documentation/hwmon/pc87427 index db5cc1227a83..8fdd08c9e48b 100644 --- a/Documentation/hwmon/pc87427 +++ b/Documentation/hwmon/pc87427 @@ -18,10 +18,11 @@ Description The National Semiconductor Super I/O chip includes complete hardware monitoring capabilities. It can monitor up to 18 voltages, 8 fans and -6 temperature sensors. Only the fans are supported at the moment. +6 temperature sensors. Only the fans and temperatures are supported at +the moment, voltages aren't. -This chip also has fan controlling features, which are not yet supported -by this driver either. +This chip also has fan controlling features (up to 4 PWM outputs), +which are partly supported by this driver. The driver assumes that no more than one chip is present, which seems reasonable. @@ -36,3 +37,23 @@ signal. Speeds down to 83 RPM can be measured. An alarm is triggered if the rotation speed drops below a programmable limit. Another alarm is triggered if the speed is too low to be measured (including stalled or missing fan). + + +Fan Speed Control +----------------- + +Fan speed can be controlled by PWM outputs. There are 4 possible modes: +always off, always on, manual and automatic. The latter isn't supported +by the driver: you can only return to that mode if it was the original +setting, and the configuration interface is missing. + + +Temperature Monitoring +---------------------- + +The PC87427 relies on external sensors (following the SensorPath +standard), so the resolution and range depend on the type of sensor +connected. The integer part can be 8-bit or 9-bit, and can be signed or +not. I couldn't find a way to figure out the external sensor data +temperature format, so user-space adjustment (typically by a factor 2) +may be required. diff --git a/Documentation/hwmon/sysfs-interface b/Documentation/hwmon/sysfs-interface index d4e2917c6f18..48ceabedf55d 100644 --- a/Documentation/hwmon/sysfs-interface +++ b/Documentation/hwmon/sysfs-interface @@ -91,12 +91,11 @@ name The chip name. I2C devices get this attribute created automatically. RO -update_rate The rate at which the chip will update readings. +update_interval The interval at which the chip will update readings. Unit: millisecond RW - Some devices have a variable update rate. This attribute - can be used to change the update rate to the desired - frequency. + Some devices have a variable update rate or interval. + This attribute can be used to change it to the desired value. ************ @@ -107,10 +106,24 @@ in[0-*]_min Voltage min value. Unit: millivolt RW +in[0-*]_lcrit Voltage critical min value. + Unit: millivolt + RW + If voltage drops to or below this limit, the system may + take drastic action such as power down or reset. At the very + least, it should report a fault. + in[0-*]_max Voltage max value. Unit: millivolt RW +in[0-*]_crit Voltage critical max value. + Unit: millivolt + RW + If voltage reaches or exceeds this limit, the system may + take drastic action such as power down or reset. At the very + least, it should report a fault. + in[0-*]_input Voltage input value. Unit: millivolt RO @@ -284,7 +297,7 @@ temp[1-*]_input Temperature input value. Unit: millidegree Celsius RO -temp[1-*]_crit Temperature critical value, typically greater than +temp[1-*]_crit Temperature critical max value, typically greater than corresponding temp_max values. Unit: millidegree Celsius RW @@ -296,6 +309,11 @@ temp[1-*]_crit_hyst from the critical value. RW +temp[1-*]_lcrit Temperature critical min value, typically lower than + corresponding temp_min values. + Unit: millidegree Celsius + RW + temp[1-*]_offset Temperature offset which is added to the temperature reading by the chip. @@ -344,9 +362,6 @@ Also see the Alarms section for status flags associated with temperatures. * Currents * ************ -Note that no known chip provides current measurements as of writing, -so this part is theoretical, so to say. - curr[1-*]_max Current max value Unit: milliampere RW @@ -471,6 +486,7 @@ limit-related alarms, not both. The driver should just reflect the hardware implementation. in[0-*]_alarm +curr[1-*]_alarm fan[1-*]_alarm temp[1-*]_alarm Channel alarm @@ -482,6 +498,8 @@ OR in[0-*]_min_alarm in[0-*]_max_alarm +curr[1-*]_min_alarm +curr[1-*]_max_alarm fan[1-*]_min_alarm fan[1-*]_max_alarm temp[1-*]_min_alarm @@ -497,7 +515,6 @@ to notify open diodes, unconnected fans etc. where the hardware supports it. When this boolean has value 1, the measurement for that channel should not be trusted. -in[0-*]_fault fan[1-*]_fault temp[1-*]_fault Input fault condition @@ -513,6 +530,7 @@ beep_enable Master beep enable RW in[0-*]_beep +curr[1-*]_beep fan[1-*]_beep temp[1-*]_beep Channel beep diff --git a/Documentation/hwmon/w83627ehf b/Documentation/hwmon/w83627ehf index b7e42ec4b26b..13d556112fc0 100644 --- a/Documentation/hwmon/w83627ehf +++ b/Documentation/hwmon/w83627ehf @@ -20,6 +20,10 @@ Supported chips: Prefix: 'w83667hg' Addresses scanned: ISA address retrieved from Super I/O registers Datasheet: not available + * Winbond W83667HG-B + Prefix: 'w83667hg' + Addresses scanned: ISA address retrieved from Super I/O registers + Datasheet: Available from Nuvoton upon request Authors: Jean Delvare <khali@linux-fr.org> @@ -32,8 +36,8 @@ Description ----------- This driver implements support for the Winbond W83627EHF, W83627EHG, -W83627DHG, W83627DHG-P and W83667HG super I/O chips. We will refer to them -collectively as Winbond chips. +W83627DHG, W83627DHG-P, W83667HG and W83667HG-B super I/O chips. +We will refer to them collectively as Winbond chips. The chips implement three temperature sensors, five fan rotation speed sensors, ten analog voltage sensors (only nine for the 627DHG), one @@ -68,14 +72,15 @@ follows: temp1 -> pwm1 temp2 -> pwm2 temp3 -> pwm3 -prog -> pwm4 (not on 667HG; the programmable setting is not supported by - the driver) +prog -> pwm4 (not on 667HG and 667HG-B; the programmable setting is not + supported by the driver) /sys files ---------- name - this is a standard hwmon device entry. For the W83627EHF and W83627EHG, - it is set to "w83627ehf" and for the W83627DHG it is set to "w83627dhg" + it is set to "w83627ehf", for the W83627DHG it is set to "w83627dhg", + and for the W83667HG it is set to "w83667hg". pwm[1-4] - this file stores PWM duty cycle or DC value (fan speed) in range: 0 (stop) to 255 (full) diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt index c375313cb128..c787ae512120 100644 --- a/Documentation/kbuild/makefiles.txt +++ b/Documentation/kbuild/makefiles.txt @@ -45,7 +45,6 @@ This document describes the Linux kernel Makefiles. --- 7.1 header-y --- 7.2 objhdr-y --- 7.3 destination-y - --- 7.4 unifdef-y (deprecated) === 8 Kbuild Variables === 9 Makefile language @@ -1245,11 +1244,6 @@ See subsequent chapter for the syntax of the Kbuild file. will be located in the directory "include/linux" when exported. - --- 7.4 unifdef-y (deprecated) - - unifdef-y is deprecated. A direct replacement is header-y. - - === 8 Kbuild Variables The top Makefile exports the following variables: diff --git a/Documentation/kernel-doc-nano-HOWTO.txt b/Documentation/kernel-doc-nano-HOWTO.txt index 27a52b35d55b..3d8a97747f77 100644 --- a/Documentation/kernel-doc-nano-HOWTO.txt +++ b/Documentation/kernel-doc-nano-HOWTO.txt @@ -345,5 +345,10 @@ documentation, in <filename>, for the functions listed. section titled <section title> from <filename>. Spaces are allowed in <section title>; do not quote the <section title>. +!C<filename> is replaced by nothing, but makes the tools check that +all DOC: sections and documented functions, symbols, etc. are used. +This makes sense to use when you use !F/!P only and want to verify +that all documentation is included. + Tim. */ <twaugh@redhat.com> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 873b68090098..8dd7248508a9 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -88,8 +88,8 @@ parameter is applicable: RAM RAM disk support is enabled. S390 S390 architecture is enabled. SCSI Appropriate SCSI support is enabled. - A lot of drivers has their options described inside of - Documentation/scsi/. + A lot of drivers have their options described inside + the Documentation/scsi/ sub-directory. SECURITY Different security models are enabled. SELINUX SELinux support is enabled. APPARMOR AppArmor support is enabled. @@ -284,27 +284,12 @@ and is between 256 and 4096 characters. It is defined in the file add_efi_memmap [EFI; X86] Include EFI memory map in kernel's map of available physical RAM. - advansys= [HW,SCSI] - See header of drivers/scsi/advansys.c. - agp= [AGP] { off | try_unsupported } off: disable AGP support try_unsupported: try to drive unsupported chipsets (may crash computer or cause data corruption) - aha152x= [HW,SCSI] - See Documentation/scsi/aha152x.txt. - - aha1542= [HW,SCSI] - Format: <portbase>[,<buson>,<busoff>[,<dmaspeed>]] - - aic7xxx= [HW,SCSI] - See Documentation/scsi/aic7xxx.txt. - - aic79xx= [HW,SCSI] - See Documentation/scsi/aic79xx.txt. - ALSA [HW,ALSA] See Documentation/sound/alsa/alsa-parameters.txt @@ -368,8 +353,6 @@ and is between 256 and 4096 characters. It is defined in the file atarimouse= [HW,MOUSE] Atari Mouse - atascsi= [HW,SCSI] Atari SCSI - atkbd.extra= [HW] Enable extra LEDs and keys on IBM RapidAccess, EzKey and similar keyboards @@ -419,10 +402,6 @@ and is between 256 and 4096 characters. It is defined in the file bttv.pll= See Documentation/video4linux/bttv/Insmod-options bttv.tuner= and Documentation/video4linux/bttv/CARDLIST - BusLogic= [HW,SCSI] - See drivers/scsi/BusLogic.c, comment before function - BusLogic_ParseDriverOptions(). - c101= [NET] Moxa C101 synchronous serial card cachesize= [BUGS=X86-32] Override level 2 CPU cache size detection. @@ -671,8 +650,6 @@ and is between 256 and 4096 characters. It is defined in the file dscc4.setup= [NET] - dtc3181e= [HW,SCSI] - dynamic_printk Enables pr_debug()/dev_dbg() calls if CONFIG_DYNAMIC_PRINTK_DEBUG has been enabled. These can also be switched on/off via @@ -713,8 +690,6 @@ and is between 256 and 4096 characters. It is defined in the file This is desgined to be used in conjunction with the boot argument: earlyprintk=vga - eata= [HW,SCSI] - edd= [EDD] Format: {"off" | "on" | "skip[mbr]"} @@ -770,12 +745,6 @@ and is between 256 and 4096 characters. It is defined in the file Format: <interval>,<probability>,<space>,<times> See also /Documentation/fault-injection/. - fd_mcs= [HW,SCSI] - See header of drivers/scsi/fd_mcs.c. - - fdomain= [HW,SCSI] - See header of drivers/scsi/fdomain.c. - floppy= [HW] See Documentation/blockdev/floppy.txt. @@ -835,14 +804,9 @@ and is between 256 and 4096 characters. It is defined in the file When zero, profiling data is discarded and associated debugfs files are removed at module unload time. - gdth= [HW,SCSI] - See header of drivers/scsi/gdth.c. - gpt [EFI] Forces disk with valid GPT signature but invalid Protective MBR to be treated as GPT. - gvp11= [HW,SCSI] - hashdist= [KNL,NUMA] Large hashes allocated during boot are distributed across NUMA nodes. Defaults on for 64bit NUMA, off otherwise. @@ -931,9 +895,6 @@ and is between 256 and 4096 characters. It is defined in the file i8k.restricted [HW] Allow controlling fans only if SYS_ADMIN capability is set. - ibmmcascsi= [HW,MCA,SCSI] IBM MicroChannel SCSI adapter - See Documentation/mca.txt. - icn= [HW,ISDN] Format: <io>[,<membase>[,<icn_id>[,<icn_id2>]]] @@ -983,9 +944,6 @@ and is between 256 and 4096 characters. It is defined in the file programs exec'd, files mmap'd for exec, and all files opened for read by uid=0. - in2000= [HW,SCSI] - See header of drivers/scsi/in2000.c. - init= [KNL] Format: <full_path> Run specified binary instead of /sbin/init as init @@ -1023,6 +981,12 @@ and is between 256 and 4096 characters. It is defined in the file result in a hardware IOTLB flush operation as opposed to batching them for performance. + intremap= [X86-64, Intel-IOMMU] + Format: { on (default) | off | nosid } + on enable Interrupt Remapping (default) + off disable Interrupt Remapping + nosid disable Source ID checking + inttest= [IA64] iomem= Disable strict checking of access to MMIO memory @@ -1063,9 +1027,6 @@ and is between 256 and 4096 characters. It is defined in the file See comment before ip2_setup() in drivers/char/ip2/ip2base.c. - ips= [HW,SCSI] Adaptec / IBM ServeRAID controller - See header of drivers/scsi/ips.c. - irqfixup [HW] When an interrupt is not handled search all handlers for it. Intended to get systems with badly broken @@ -1341,9 +1302,6 @@ and is between 256 and 4096 characters. It is defined in the file ltpc= [NET] Format: <io>,<irq>,<dma> - mac5380= [HW,SCSI] Format: - <can_queue>,<cmd_per_lun>,<sg_tablesize>,<hostid>,<use_tags> - machvec= [IA64] Force the use of a particular machine-vector (machvec) in a generic kernel. Example: machvec=hpzx1_swiotlb @@ -1365,13 +1323,6 @@ and is between 256 and 4096 characters. It is defined in the file be mounted Format: <1-256> - max_luns= [SCSI] Maximum number of LUNs to probe. - Should be between 1 and 2^32-1. - - max_report_luns= - [SCSI] Maximum number of LUNs received. - Should be between 1 and 16384. - mcatest= [IA-64] mce [X86-32] Machine Check Exception @@ -1568,19 +1519,6 @@ and is between 256 and 4096 characters. It is defined in the file n2= [NET] SDL Inc. RISCom/N2 synchronous serial card - NCR_D700= [HW,SCSI] - See header of drivers/scsi/NCR_D700.c. - - ncr5380= [HW,SCSI] - - ncr53c400= [HW,SCSI] - - ncr53c400a= [HW,SCSI] - - ncr53c406a= [HW,SCSI] - - ncr53c8xx= [HW,SCSI] - netdev= [NET] Network devices parameters Format: <irq>,<io>,<mem_start>,<mem_end>,<name> Note that mem_start is often overloaded to mean @@ -1749,6 +1687,7 @@ and is between 256 and 4096 characters. It is defined in the file nointremap [X86-64, Intel-IOMMU] Do not enable interrupt remapping. + [Deprecated - use intremap=off] nointroute [IA-64] @@ -1859,10 +1798,6 @@ and is between 256 and 4096 characters. It is defined in the file OSS [HW,OSS] See Documentation/sound/oss/oss-parameters.txt - osst= [HW,SCSI] SCSI Tape Driver - Format: <buffer_size>,<write_threshold> - See also Documentation/scsi/st.txt. - panic= [KNL] Kernel behaviour on panic Format: <timeout> @@ -1895,9 +1830,6 @@ and is between 256 and 4096 characters. It is defined in the file Currently this function knows 686a and 8231 chips. Format: [spp|ps2|epp|ecp|ecpepp] - pas16= [HW,SCSI] - See header of drivers/scsi/pas16.c. - pause_on_oops= Halt all CPUs after the first oops has been printed for the specified number of seconds. This is to be used if @@ -2042,15 +1974,18 @@ and is between 256 and 4096 characters. It is defined in the file force Enable ASPM even on devices that claim not to support it. WARNING: Forcing ASPM on may cause system lockups. + pcie_ports= [PCIE] PCIe ports handling: + auto Ask the BIOS whether or not to use native PCIe services + associated with PCIe ports (PME, hot-plug, AER). Use + them only if that is allowed by the BIOS. + native Use native PCIe services associated with PCIe ports + unconditionally. + compat Treat PCIe ports as PCI-to-PCI bridges, disable the PCIe + ports driver. + pcie_pme= [PCIE,PM] Native PCIe PME signaling options: - Format: {auto|force}[,nomsi] - auto Use native PCIe PME signaling if the BIOS allows the - kernel to control PCIe config registers of root ports. - force Use native PCIe PME signaling even if the BIOS refuses - to allow the kernel to control the relevant PCIe config - registers. nomsi Do not use MSI for native PCIe PME signaling (this makes - all PCIe root ports use INTx for everything). + all PCIe root ports use INTx for all services). pcmv= [HW,PCMCIA] BadgePAD 4 @@ -2264,30 +2199,6 @@ and is between 256 and 4096 characters. It is defined in the file sched_debug [KNL] Enables verbose scheduler debug messages. - scsi_debug_*= [SCSI] - See drivers/scsi/scsi_debug.c. - - scsi_default_dev_flags= - [SCSI] SCSI default device flags - Format: <integer> - - scsi_dev_flags= [SCSI] Black/white list entry for vendor and model - Format: <vendor>:<model>:<flags> - (flags are integer value) - - scsi_logging_level= [SCSI] a bit mask of logging levels - See drivers/scsi/scsi_logging.h for bits. Also - settable via sysctl at dev.scsi.logging_level - (/proc/sys/dev/scsi/logging_level). - There is also a nice 'scsi_logging_level' script in the - S390-tools package, available for download at - http://www-128.ibm.com/developerworks/linux/linux390/s390-tools-1.5.4.html - - scsi_mod.scan= [SCSI] sync (default) scans SCSI busses as they are - discovered. async scans them in kernel threads, - allowing boot to proceed. none ignores them, expecting - user space to do the scan. - security= [SECURITY] Choose a security module to enable at boot. If this boot parameter is not specified, only the first security module asking for security registration will be @@ -2321,9 +2232,6 @@ and is between 256 and 4096 characters. It is defined in the file The parameter means the number of CPUs to show, for example 1 means boot CPU only. - sim710= [SCSI,HW] - See header of drivers/scsi/sim710.c. - simeth= [IA-64] simscsi= @@ -2395,9 +2303,6 @@ and is between 256 and 4096 characters. It is defined in the file spia_pedr= spia_peddr= - st= [HW,SCSI] SCSI tape parameters (buffers, etc.) - See Documentation/scsi/st.txt. - stacktrace [FTRACE] Enabled the stack tracer on boot up. @@ -2455,18 +2360,12 @@ and is between 256 and 4096 characters. It is defined in the file switches= [HW,M68k] - sym53c416= [HW,SCSI] - See header of drivers/scsi/sym53c416.c. - sysrq_always_enabled [KNL] Ignore sysrq setting - this boot parameter will neutralize any effect of /proc/sys/kernel/sysrq. Useful for debugging. - t128= [HW,SCSI] - See header of drivers/scsi/t128.c. - tdfx= [HW,DRM] test_suspend= [SUSPEND] @@ -2503,10 +2402,6 @@ and is between 256 and 4096 characters. It is defined in the file <deci-seconds>: poll all this frequency 0: no polling (default) - tmscsim= [HW,SCSI] - See comment before function dc390_setup() in - drivers/scsi/tmscsim.c. - topology= [S390] Format: {off | on} Specify if the kernel should make use of the cpu @@ -2547,9 +2442,6 @@ and is between 256 and 4096 characters. It is defined in the file <port#>,<js1>,<js2>,<js3>,<js4>,<js5>,<js6>,<js7> See also Documentation/input/joystick-parport.txt - u14-34f= [HW,SCSI] UltraStor 14F/34F SCSI host adapter - See header of drivers/scsi/u14-34f.c. - uhash_entries= [KNL,NET] Set number of hash buckets for UDP/UDP-Lite connections @@ -2715,12 +2607,6 @@ and is between 256 and 4096 characters. It is defined in the file overridden by individual drivers. 0 will hide cursors, 1 will display them. - wd33c93= [HW,SCSI] - See header of drivers/scsi/wd33c93.c. - - wd7000= [HW,SCSI] - See header of drivers/scsi/wd7000.c. - watchdog timers [HW,WDT] For information on watchdog timers, see Documentation/watchdog/watchdog-parameters.txt or other driver-specific files in the @@ -2746,8 +2632,10 @@ and is between 256 and 4096 characters. It is defined in the file aux-ide-disks -- unplug non-primary-master IDE devices nics -- unplug network devices all -- unplug all emulated devices (NICs and IDE disks) - ignore -- continue loading the Xen platform PCI driver even - if the version check failed + unnecessary -- unplugging emulated devices is + unnecessary even if the host did not respond to + the unplug protocol + never -- do not unplug even if version check succeeds xirc2ps_cs= [NET,PCMCIA] Format: diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index f6f80257addb..1565eefd6fd5 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt @@ -1024,6 +1024,10 @@ ThinkPad-specific interface. The driver will disable its native backlight brightness control interface if it detects that the standard ACPI interface is available in the ThinkPad. +If you want to use the thinkpad-acpi backlight brightness control +instead of the generic ACPI video backlight brightness control for some +reason, you should use the acpi_backlight=vendor kernel parameter. + The brightness_enable module parameter can be used to control whether the LCD brightness control feature will be enabled when available. brightness_enable=0 forces it to be disabled. brightness_enable=1 diff --git a/Documentation/lguest/Makefile b/Documentation/lguest/Makefile index 28c8cdfcafd8..bebac6b4f332 100644 --- a/Documentation/lguest/Makefile +++ b/Documentation/lguest/Makefile @@ -1,5 +1,6 @@ # This creates the demonstration utility "lguest" which runs a Linux guest. -CFLAGS:=-m32 -Wall -Wmissing-declarations -Wmissing-prototypes -O3 -I../../include -I../../arch/x86/include -U_FORTIFY_SOURCE +# Missing headers? Add "-I../../include -I../../arch/x86/include" +CFLAGS:=-m32 -Wall -Wmissing-declarations -Wmissing-prototypes -O3 -U_FORTIFY_SOURCE all: lguest diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c index e9ce3c554514..8a6a8c6d4980 100644 --- a/Documentation/lguest/lguest.c +++ b/Documentation/lguest/lguest.c @@ -39,14 +39,14 @@ #include <limits.h> #include <stddef.h> #include <signal.h> -#include "linux/lguest_launcher.h" -#include "linux/virtio_config.h" -#include "linux/virtio_net.h" -#include "linux/virtio_blk.h" -#include "linux/virtio_console.h" -#include "linux/virtio_rng.h" -#include "linux/virtio_ring.h" -#include "asm/bootparam.h" +#include <linux/virtio_config.h> +#include <linux/virtio_net.h> +#include <linux/virtio_blk.h> +#include <linux/virtio_console.h> +#include <linux/virtio_rng.h> +#include <linux/virtio_ring.h> +#include <asm/bootparam.h> +#include "../../include/linux/lguest_launcher.h" /*L:110 * We can ignore the 42 include files we need for this program, but I do want * to draw attention to the use of kernel-style types. @@ -1447,14 +1447,15 @@ static void add_to_bridge(int fd, const char *if_name, const char *br_name) static void configure_device(int fd, const char *tapif, u32 ipaddr) { struct ifreq ifr; - struct sockaddr_in *sin = (struct sockaddr_in *)&ifr.ifr_addr; + struct sockaddr_in sin; memset(&ifr, 0, sizeof(ifr)); strcpy(ifr.ifr_name, tapif); /* Don't read these incantations. Just cut & paste them like I did! */ - sin->sin_family = AF_INET; - sin->sin_addr.s_addr = htonl(ipaddr); + sin.sin_family = AF_INET; + sin.sin_addr.s_addr = htonl(ipaddr); + memcpy(&ifr.ifr_addr, &sin, sizeof(sin)); if (ioctl(fd, SIOCSIFADDR, &ifr) != 0) err(1, "Setting %s interface address", tapif); ifr.ifr_flags = IFF_UP; diff --git a/Documentation/mutex-design.txt b/Documentation/mutex-design.txt index c91ccc0720fa..38c10fd7f411 100644 --- a/Documentation/mutex-design.txt +++ b/Documentation/mutex-design.txt @@ -9,7 +9,7 @@ firstly, there's nothing wrong with semaphores. But if the simpler mutex semantics are sufficient for your code, then there are a couple of advantages of mutexes: - - 'struct mutex' is smaller on most architectures: .e.g on x86, + - 'struct mutex' is smaller on most architectures: E.g. on x86, 'struct semaphore' is 20 bytes, 'struct mutex' is 16 bytes. A smaller structure size means less RAM footprint, and better CPU-cache utilization. @@ -136,3 +136,4 @@ the APIs of 'struct mutex' have been streamlined: void mutex_lock_nested(struct mutex *lock, unsigned int subclass); int mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass); + int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock); diff --git a/Documentation/power/regulator/overview.txt b/Documentation/power/regulator/overview.txt index 9363e056188a..8ed17587a74b 100644 --- a/Documentation/power/regulator/overview.txt +++ b/Documentation/power/regulator/overview.txt @@ -13,7 +13,7 @@ regulators (where voltage output is controllable) and current sinks (where current limit is controllable). (C) 2008 Wolfson Microelectronics PLC. -Author: Liam Girdwood <lg@opensource.wolfsonmicro.com> +Author: Liam Girdwood <lrg@slimlogic.co.uk> Nomenclature diff --git a/Documentation/powerpc/booting-without-of.txt b/Documentation/powerpc/booting-without-of.txt index 568fa08e82e5..302db5da49b3 100644 --- a/Documentation/powerpc/booting-without-of.txt +++ b/Documentation/powerpc/booting-without-of.txt @@ -49,40 +49,13 @@ Table of Contents f) MDIO on GPIOs g) SPI busses - VII - Marvell Discovery mv64[345]6x System Controller chips - 1) The /system-controller node - 2) Child nodes of /system-controller - a) Marvell Discovery MDIO bus - b) Marvell Discovery ethernet controller - c) Marvell Discovery PHY nodes - d) Marvell Discovery SDMA nodes - e) Marvell Discovery BRG nodes - f) Marvell Discovery CUNIT nodes - g) Marvell Discovery MPSCROUTING nodes - h) Marvell Discovery MPSCINTR nodes - i) Marvell Discovery MPSC nodes - j) Marvell Discovery Watch Dog Timer nodes - k) Marvell Discovery I2C nodes - l) Marvell Discovery PIC (Programmable Interrupt Controller) nodes - m) Marvell Discovery MPP (Multipurpose Pins) multiplexing nodes - n) Marvell Discovery GPP (General Purpose Pins) nodes - o) Marvell Discovery PCI host bridge node - p) Marvell Discovery CPU Error nodes - q) Marvell Discovery SRAM Controller nodes - r) Marvell Discovery PCI Error Handler nodes - s) Marvell Discovery Memory Controller nodes - - VIII - Specifying interrupt information for devices + VII - Specifying interrupt information for devices 1) interrupts property 2) interrupt-parent property 3) OpenPIC Interrupt Controllers 4) ISA Interrupt Controllers - IX - Specifying GPIO information for devices - 1) gpios property - 2) gpio-controller nodes - - X - Specifying device power management information (sleep property) + VIII - Specifying device power management information (sleep property) Appendix A - Sample SOC node for MPC8540 diff --git a/Documentation/powerpc/hvcs.txt b/Documentation/powerpc/hvcs.txt index f93462c5db25..6d8be3468d7d 100644 --- a/Documentation/powerpc/hvcs.txt +++ b/Documentation/powerpc/hvcs.txt @@ -560,7 +560,7 @@ The proper channel for reporting bugs is either through the Linux OS distribution company that provided your OS or by posting issues to the PowerPC development mailing list at: -linuxppc-dev@ozlabs.org +linuxppc-dev@lists.ozlabs.org This request is to provide a documented and searchable public exchange of the problems and solutions surrounding this driver for the benefit of diff --git a/Documentation/scsi/scsi-parameters.txt b/Documentation/scsi/scsi-parameters.txt new file mode 100644 index 000000000000..21e5798526ee --- /dev/null +++ b/Documentation/scsi/scsi-parameters.txt @@ -0,0 +1,139 @@ + SCSI Kernel Parameters + ~~~~~~~~~~~~~~~~~~~~~~ + +See Documentation/kernel-parameters.txt for general information on +specifying module parameters. + +This document may not be entirely up to date and comprehensive. The command +"modinfo -p ${modulename}" shows a current list of all parameters of a loadable +module. Loadable modules, after being loaded into the running kernel, also +reveal their parameters in /sys/module/${modulename}/parameters/. Some of these +parameters may be changed at runtime by the command +"echo -n ${value} > /sys/module/${modulename}/parameters/${parm}". + + + advansys= [HW,SCSI] + See header of drivers/scsi/advansys.c. + + aha152x= [HW,SCSI] + See Documentation/scsi/aha152x.txt. + + aha1542= [HW,SCSI] + Format: <portbase>[,<buson>,<busoff>[,<dmaspeed>]] + + aic7xxx= [HW,SCSI] + See Documentation/scsi/aic7xxx.txt. + + aic79xx= [HW,SCSI] + See Documentation/scsi/aic79xx.txt. + + atascsi= [HW,SCSI] Atari SCSI + + BusLogic= [HW,SCSI] + See drivers/scsi/BusLogic.c, comment before function + BusLogic_ParseDriverOptions(). + + dtc3181e= [HW,SCSI] + + eata= [HW,SCSI] + + fd_mcs= [HW,SCSI] + See header of drivers/scsi/fd_mcs.c. + + fdomain= [HW,SCSI] + See header of drivers/scsi/fdomain.c. + + gdth= [HW,SCSI] + See header of drivers/scsi/gdth.c. + + gvp11= [HW,SCSI] + + ibmmcascsi= [HW,MCA,SCSI] IBM MicroChannel SCSI adapter + See Documentation/mca.txt. + + in2000= [HW,SCSI] + See header of drivers/scsi/in2000.c. + + ips= [HW,SCSI] Adaptec / IBM ServeRAID controller + See header of drivers/scsi/ips.c. + + mac5380= [HW,SCSI] Format: + <can_queue>,<cmd_per_lun>,<sg_tablesize>,<hostid>,<use_tags> + + max_luns= [SCSI] Maximum number of LUNs to probe. + Should be between 1 and 2^32-1. + + max_report_luns= + [SCSI] Maximum number of LUNs received. + Should be between 1 and 16384. + + NCR_D700= [HW,SCSI] + See header of drivers/scsi/NCR_D700.c. + + ncr5380= [HW,SCSI] + + ncr53c400= [HW,SCSI] + + ncr53c400a= [HW,SCSI] + + ncr53c406a= [HW,SCSI] + + ncr53c8xx= [HW,SCSI] + + nodisconnect [HW,SCSI,M68K] Disables SCSI disconnects. + + osst= [HW,SCSI] SCSI Tape Driver + Format: <buffer_size>,<write_threshold> + See also Documentation/scsi/st.txt. + + pas16= [HW,SCSI] + See header of drivers/scsi/pas16.c. + + scsi_debug_*= [SCSI] + See drivers/scsi/scsi_debug.c. + + scsi_default_dev_flags= + [SCSI] SCSI default device flags + Format: <integer> + + scsi_dev_flags= [SCSI] Black/white list entry for vendor and model + Format: <vendor>:<model>:<flags> + (flags are integer value) + + scsi_logging_level= [SCSI] a bit mask of logging levels + See drivers/scsi/scsi_logging.h for bits. Also + settable via sysctl at dev.scsi.logging_level + (/proc/sys/dev/scsi/logging_level). + There is also a nice 'scsi_logging_level' script in the + S390-tools package, available for download at + http://www-128.ibm.com/developerworks/linux/linux390/s390-tools-1.5.4.html + + scsi_mod.scan= [SCSI] sync (default) scans SCSI busses as they are + discovered. async scans them in kernel threads, + allowing boot to proceed. none ignores them, expecting + user space to do the scan. + + sim710= [SCSI,HW] + See header of drivers/scsi/sim710.c. + + st= [HW,SCSI] SCSI tape parameters (buffers, etc.) + See Documentation/scsi/st.txt. + + sym53c416= [HW,SCSI] + See header of drivers/scsi/sym53c416.c. + + t128= [HW,SCSI] + See header of drivers/scsi/t128.c. + + tmscsim= [HW,SCSI] + See comment before function dc390_setup() in + drivers/scsi/tmscsim.c. + + u14-34f= [HW,SCSI] UltraStor 14F/34F SCSI host adapter + See header of drivers/scsi/u14-34f.c. + + wd33c93= [HW,SCSI] + See header of drivers/scsi/wd33c93.c. + + wd7000= [HW,SCSI] + See header of drivers/scsi/wd7000.c. diff --git a/Documentation/sound/alsa/HD-Audio-Models.txt b/Documentation/sound/alsa/HD-Audio-Models.txt index ce46fa1e643e..37c6aad5e590 100644 --- a/Documentation/sound/alsa/HD-Audio-Models.txt +++ b/Documentation/sound/alsa/HD-Audio-Models.txt @@ -296,6 +296,7 @@ Conexant 5051 Conexant 5066 ============= laptop Basic Laptop config (default) + hp-laptop HP laptops, e g G60 dell-laptop Dell laptops dell-vostro Dell Vostro olpc-xo-1_5 OLPC XO 1.5 diff --git a/Documentation/workqueue.txt b/Documentation/workqueue.txt new file mode 100644 index 000000000000..e4498a2872c3 --- /dev/null +++ b/Documentation/workqueue.txt @@ -0,0 +1,380 @@ + +Concurrency Managed Workqueue (cmwq) + +September, 2010 Tejun Heo <tj@kernel.org> + Florian Mickler <florian@mickler.org> + +CONTENTS + +1. Introduction +2. Why cmwq? +3. The Design +4. Application Programming Interface (API) +5. Example Execution Scenarios +6. Guidelines + + +1. Introduction + +There are many cases where an asynchronous process execution context +is needed and the workqueue (wq) API is the most commonly used +mechanism for such cases. + +When such an asynchronous execution context is needed, a work item +describing which function to execute is put on a queue. An +independent thread serves as the asynchronous execution context. The +queue is called workqueue and the thread is called worker. + +While there are work items on the workqueue the worker executes the +functions associated with the work items one after the other. When +there is no work item left on the workqueue the worker becomes idle. +When a new work item gets queued, the worker begins executing again. + + +2. Why cmwq? + +In the original wq implementation, a multi threaded (MT) wq had one +worker thread per CPU and a single threaded (ST) wq had one worker +thread system-wide. A single MT wq needed to keep around the same +number of workers as the number of CPUs. The kernel grew a lot of MT +wq users over the years and with the number of CPU cores continuously +rising, some systems saturated the default 32k PID space just booting +up. + +Although MT wq wasted a lot of resource, the level of concurrency +provided was unsatisfactory. The limitation was common to both ST and +MT wq albeit less severe on MT. Each wq maintained its own separate +worker pool. A MT wq could provide only one execution context per CPU +while a ST wq one for the whole system. Work items had to compete for +those very limited execution contexts leading to various problems +including proneness to deadlocks around the single execution context. + +The tension between the provided level of concurrency and resource +usage also forced its users to make unnecessary tradeoffs like libata +choosing to use ST wq for polling PIOs and accepting an unnecessary +limitation that no two polling PIOs can progress at the same time. As +MT wq don't provide much better concurrency, users which require +higher level of concurrency, like async or fscache, had to implement +their own thread pool. + +Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with +focus on the following goals. + +* Maintain compatibility with the original workqueue API. + +* Use per-CPU unified worker pools shared by all wq to provide + flexible level of concurrency on demand without wasting a lot of + resource. + +* Automatically regulate worker pool and level of concurrency so that + the API users don't need to worry about such details. + + +3. The Design + +In order to ease the asynchronous execution of functions a new +abstraction, the work item, is introduced. + +A work item is a simple struct that holds a pointer to the function +that is to be executed asynchronously. Whenever a driver or subsystem +wants a function to be executed asynchronously it has to set up a work +item pointing to that function and queue that work item on a +workqueue. + +Special purpose threads, called worker threads, execute the functions +off of the queue, one after the other. If no work is queued, the +worker threads become idle. These worker threads are managed in so +called thread-pools. + +The cmwq design differentiates between the user-facing workqueues that +subsystems and drivers queue work items on and the backend mechanism +which manages thread-pool and processes the queued work items. + +The backend is called gcwq. There is one gcwq for each possible CPU +and one gcwq to serve work items queued on unbound workqueues. + +Subsystems and drivers can create and queue work items through special +workqueue API functions as they see fit. They can influence some +aspects of the way the work items are executed by setting flags on the +workqueue they are putting the work item on. These flags include +things like CPU locality, reentrancy, concurrency limits and more. To +get a detailed overview refer to the API description of +alloc_workqueue() below. + +When a work item is queued to a workqueue, the target gcwq is +determined according to the queue parameters and workqueue attributes +and appended on the shared worklist of the gcwq. For example, unless +specifically overridden, a work item of a bound workqueue will be +queued on the worklist of exactly that gcwq that is associated to the +CPU the issuer is running on. + +For any worker pool implementation, managing the concurrency level +(how many execution contexts are active) is an important issue. cmwq +tries to keep the concurrency at a minimal but sufficient level. +Minimal to save resources and sufficient in that the system is used at +its full capacity. + +Each gcwq bound to an actual CPU implements concurrency management by +hooking into the scheduler. The gcwq is notified whenever an active +worker wakes up or sleeps and keeps track of the number of the +currently runnable workers. Generally, work items are not expected to +hog a CPU and consume many cycles. That means maintaining just enough +concurrency to prevent work processing from stalling should be +optimal. As long as there are one or more runnable workers on the +CPU, the gcwq doesn't start execution of a new work, but, when the +last running worker goes to sleep, it immediately schedules a new +worker so that the CPU doesn't sit idle while there are pending work +items. This allows using a minimal number of workers without losing +execution bandwidth. + +Keeping idle workers around doesn't cost other than the memory space +for kthreads, so cmwq holds onto idle ones for a while before killing +them. + +For an unbound wq, the above concurrency management doesn't apply and +the gcwq for the pseudo unbound CPU tries to start executing all work +items as soon as possible. The responsibility of regulating +concurrency level is on the users. There is also a flag to mark a +bound wq to ignore the concurrency management. Please refer to the +API section for details. + +Forward progress guarantee relies on that workers can be created when +more execution contexts are necessary, which in turn is guaranteed +through the use of rescue workers. All work items which might be used +on code paths that handle memory reclaim are required to be queued on +wq's that have a rescue-worker reserved for execution under memory +pressure. Else it is possible that the thread-pool deadlocks waiting +for execution contexts to free up. + + +4. Application Programming Interface (API) + +alloc_workqueue() allocates a wq. The original create_*workqueue() +functions are deprecated and scheduled for removal. alloc_workqueue() +takes three arguments - @name, @flags and @max_active. @name is the +name of the wq and also used as the name of the rescuer thread if +there is one. + +A wq no longer manages execution resources but serves as a domain for +forward progress guarantee, flush and work item attributes. @flags +and @max_active control how work items are assigned execution +resources, scheduled and executed. + +@flags: + + WQ_NON_REENTRANT + + By default, a wq guarantees non-reentrance only on the same + CPU. A work item may not be executed concurrently on the same + CPU by multiple workers but is allowed to be executed + concurrently on multiple CPUs. This flag makes sure + non-reentrance is enforced across all CPUs. Work items queued + to a non-reentrant wq are guaranteed to be executed by at most + one worker system-wide at any given time. + + WQ_UNBOUND + + Work items queued to an unbound wq are served by a special + gcwq which hosts workers which are not bound to any specific + CPU. This makes the wq behave as a simple execution context + provider without concurrency management. The unbound gcwq + tries to start execution of work items as soon as possible. + Unbound wq sacrifices locality but is useful for the following + cases. + + * Wide fluctuation in the concurrency level requirement is + expected and using bound wq may end up creating large number + of mostly unused workers across different CPUs as the issuer + hops through different CPUs. + + * Long running CPU intensive workloads which can be better + managed by the system scheduler. + + WQ_FREEZEABLE + + A freezeable wq participates in the freeze phase of the system + suspend operations. Work items on the wq are drained and no + new work item starts execution until thawed. + + WQ_RESCUER + + All wq which might be used in the memory reclaim paths _MUST_ + have this flag set. This reserves one worker exclusively for + the execution of this wq under memory pressure. + + WQ_HIGHPRI + + Work items of a highpri wq are queued at the head of the + worklist of the target gcwq and start execution regardless of + the current concurrency level. In other words, highpri work + items will always start execution as soon as execution + resource is available. + + Ordering among highpri work items is preserved - a highpri + work item queued after another highpri work item will start + execution after the earlier highpri work item starts. + + Although highpri work items are not held back by other + runnable work items, they still contribute to the concurrency + level. Highpri work items in runnable state will prevent + non-highpri work items from starting execution. + + This flag is meaningless for unbound wq. + + WQ_CPU_INTENSIVE + + Work items of a CPU intensive wq do not contribute to the + concurrency level. In other words, runnable CPU intensive + work items will not prevent other work items from starting + execution. This is useful for bound work items which are + expected to hog CPU cycles so that their execution is + regulated by the system scheduler. + + Although CPU intensive work items don't contribute to the + concurrency level, start of their executions is still + regulated by the concurrency management and runnable + non-CPU-intensive work items can delay execution of CPU + intensive work items. + + This flag is meaningless for unbound wq. + + WQ_HIGHPRI | WQ_CPU_INTENSIVE + + This combination makes the wq avoid interaction with + concurrency management completely and behave as a simple + per-CPU execution context provider. Work items queued on a + highpri CPU-intensive wq start execution as soon as resources + are available and don't affect execution of other work items. + +@max_active: + +@max_active determines the maximum number of execution contexts per +CPU which can be assigned to the work items of a wq. For example, +with @max_active of 16, at most 16 work items of the wq can be +executing at the same time per CPU. + +Currently, for a bound wq, the maximum limit for @max_active is 512 +and the default value used when 0 is specified is 256. For an unbound +wq, the limit is higher of 512 and 4 * num_possible_cpus(). These +values are chosen sufficiently high such that they are not the +limiting factor while providing protection in runaway cases. + +The number of active work items of a wq is usually regulated by the +users of the wq, more specifically, by how many work items the users +may queue at the same time. Unless there is a specific need for +throttling the number of active work items, specifying '0' is +recommended. + +Some users depend on the strict execution ordering of ST wq. The +combination of @max_active of 1 and WQ_UNBOUND is used to achieve this +behavior. Work items on such wq are always queued to the unbound gcwq +and only one work item can be active at any given time thus achieving +the same ordering property as ST wq. + + +5. Example Execution Scenarios + +The following example execution scenarios try to illustrate how cmwq +behave under different configurations. + + Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU. + w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms + again before finishing. w1 and w2 burn CPU for 5ms then sleep for + 10ms. + +Ignoring all other tasks, works and processing overhead, and assuming +simple FIFO scheduling, the following is one highly simplified version +of possible sequences of events with the original wq. + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 starts and burns CPU + 25 w1 sleeps + 35 w1 wakes up and finishes + 35 w2 starts and burns CPU + 40 w2 sleeps + 50 w2 wakes up and finishes + +And with cmwq with @max_active >= 3, + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 starts and burns CPU + 10 w1 sleeps + 10 w2 starts and burns CPU + 15 w2 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 25 w2 wakes up and finishes + +If @max_active == 2, + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 starts and burns CPU + 10 w1 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 20 w2 starts and burns CPU + 25 w2 sleeps + 35 w2 wakes up and finishes + +Now, let's assume w1 and w2 are queued to a different wq q1 which has +WQ_HIGHPRI set, + + TIME IN MSECS EVENT + 0 w1 and w2 start and burn CPU + 5 w1 sleeps + 10 w2 sleeps + 10 w0 starts and burns CPU + 15 w0 sleeps + 15 w1 wakes up and finishes + 20 w2 wakes up and finishes + 25 w0 wakes up and burns CPU + 30 w0 finishes + +If q1 has WQ_CPU_INTENSIVE set, + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 and w2 start and burn CPU + 10 w1 sleeps + 15 w2 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 25 w2 wakes up and finishes + + +6. Guidelines + +* Do not forget to use WQ_RESCUER if a wq may process work items which + are used during memory reclaim. Each wq with WQ_RESCUER set has one + rescuer thread reserved for it. If there is dependency among + multiple work items used during memory reclaim, they should be + queued to separate wq each with WQ_RESCUER. + +* Unless strict ordering is required, there is no need to use ST wq. + +* Unless there is a specific need, using 0 for @max_active is + recommended. In most use cases, concurrency level usually stays + well under the default limit. + +* A wq serves as a domain for forward progress guarantee (WQ_RESCUER), + flush and work item attributes. Work items which are not involved + in memory reclaim and don't need to be flushed as a part of a group + of work items, and don't require any special attribute, can use one + of the system wq. There is no difference in execution + characteristics between using a dedicated wq and a system wq. + +* Unless work items are expected to consume a huge amount of CPU + cycles, using a bound wq is usually beneficial due to the increased + level of locality in wq operations and work item execution. |