JYuan Learning Log: December 2013

Friday, December 27, 2013

Learning HBase

HBase MapReduce
http://hbase.apache.org/book/mapreduce.example.html

Hbase Filter
http://hbase.apache.org/book/client.filter.html

HBase Java Simple Example
http://bestlovejava.blogspot.com/2013/07/hbase-java-simple-example.html

scan Scan a table; pass table name and optionally a dictionary of scanner
specifications. Scanner specifications may include one or more of
the following: LIMIT, STARTROW, STOPROW, TIMESTAMP, or COLUMNS. If
no columns are specified, all columns will be scanned. To scan all
members of a column family, leave the qualifier empty as in
'col_family:'. Examples:

hbase> scan '.META.'
hbase> scan '.META.', {COLUMNS => 'info:regioninfo'}
hbase> scan 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, \
STARTROW => 'xyz'}

scan 'mytable', {STARTROW => 'abc', ENDROW => 'abd'}

.The filter can be specified in two ways:
1. Using a filterString – more information on this is available in the
Filter Language document attached to the HBASE-4176 JIRA
2. Using the entire package name of the filter.Some examples:hbase> scan ‘.META.’
hbase> scan ‘.META.’, {COLUMNS => ‘info:regioninfo’}
hbase> scan ‘t1′, {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => ‘xyz’}
hbase> scan ‘t1′, {COLUMNS => ‘c1′, TIMERANGE => [1303668804, 1303668904]}
hbase> scan ‘t1′, {FILTER => “(PrefixFilter (‘row2′) AND
(QualifierFilter (>=, ‘binary:xyz’))) AND (TimestampsFilter ( 123, 456))”}
hbase> scan ‘t1′, {FILTER =>
org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
For experts, there is an additional option — CACHE_BLOCKS — whi

hbase> scan '.META.', {COLUMNS => 'info:regioninfo'}
hbase> scan 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}
hbase> scan 't1', {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
hbase> scan 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]}

scan 'test', {COLUMNS => ['F'],FILTER => "(SingleColumnValueFilter('F','u',=,'regexstring:http:.*pdf',true,true)) AND (SingleColumnValueFilter('F','s',=,'binary:2',true,true))"}

http://bestlovejava.blogspot.com/2013/07/hbase-java-simple-example.html
HTablePool pool = new HTablePool(configuration, 1000);
HTable table = (HTable) pool.getTable(tableName);
Filter filter = new SingleColumnValueFilter(Bytes
.toBytes("column1"), null, CompareOp.EQUAL, Bytes
.toBytes("aaa")); // ??column1???aaa?????
Scan s = new Scan();
s.setFilter(filter);
ResultScanner rs = table.getScanner(s);

Tuesday, December 24, 2013

Learning Hadoop: Configuration Files

hadoop-env.sh
find /etc -name hadoop-env.sh 2>/dev/null
/etc/hadoop/conf.cloudera.yarn/hadoop-env.sh
/etc/hadoop/conf.cloudera.hdfs/hadoop-env.sh

HADOOP_HOME/conf/slaves
core-site.xml and hdfs-site.xml

hadoop namenode –format
start-dfs.sh

stop-dfs.sh

Learning Hadoop

HADOOP: SETUP MAVEN PROJECT FOR MAPREDUCE IN 5MN
http://hadoopi.wordpress.com/2013/05/25/setup-maven-project-for-hadoop-in-5mn/

People You May Know” Friendship Recommendation with Hadoop
http://importantfish.com/people-you-may-know-friendship-recommendation-with-hadoop/

a jobtracker and
a number of tasktrackers. The jobtracker coordinates all the jobs run on the system by
scheduling tasks to run on tasktrackers. Tasktrackers run tasks and send progress
reports to the jobtracker, which keeps a record of the overall progress of each job. If a
task fails, the jobtracker can reschedule it on a different tasktracker.

Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined
map function for each record in the split.

Hadoop does its best to run the map task on a node where the input data resides in
HDFS. This is called the data locality optimization because it doesn’t use valuable cluster
bandwidth.

Sometimes, however, all three nodes hosting the HDFS block replicas
for a map task’s input split are running other map tasks, so the job scheduler will look
for a free map slot on a node in the same rack as one of the blocks. Very occasionally
even this is not possible, so an off-rack node is used, which results in an inter-rack
network transfer.

the optimal split size is the same as the block size: it is the
largest size of input that can be guaranteed to be stored on a single node. If the split
spanned two blocks, it would be unlikely that any HDFS node stored both blocks, so
some of the split would have to be transferred across the network to the node running the task.

Map tasks write their output to the local disk, not to HDFS.
The output of the reduce is normally stored in HDFS for reliability.
for each HDFS block of the reduce output, the first replica is stored on the local node, with
other replicas being stored on off-rack nodes.

Combiner Functions
a combiner function to be run on the map output, and the combiner function’s output forms the input to the reduce function. Because the combiner function
is an optimization, Hadoop does not provide a guarantee of how many times it
will call it for a particular map output record, if at all. In other words, calling the
combiner function zero, one, or many times should produce the same output from the
reducer.

Hadoop Streaming
Hadoop provides an API to MapReduce that allows you to write your map and reduce
functions in languages other than Java. Hadoop Streaming uses Unix standard streams
as the interface between Hadoop and your program, so you can use any language that
can read standard input and write to standard output to write your MapReduce
program.

STDIN.each_line do |line|
val = line
year, temp, q = val[15,4], val[87,5], val[92,1]
puts "#{year}\t#{temp}" if (temp != "+9999" && q =~ /[01459]/)
end

last_key, max_val = nil, -1000000
STDIN.each_line do |line|
key, val = line.split("\t")
if last_key && last_key != key
puts "#{last_key}\t#{max_val}"
last_key, max_val = key, val.to_i
else
last_key, max_val = key, [max_val, val.to_i].max
end
end
puts "#{last_key}\t#{max_val}" if last_key

In contrast to
the Java API, where you are provided an iterator over each key group, in Streaming you
have to find key group boundaries in your program.

cat input/ncdc/sample.txt | ch02/src/main/ruby/max_temperature_map.rb | \
sort | ch02/src/main/ruby/max_temperature_reduce.rb

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Hadoop Pipes
Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce. Unlike Streaming,
which uses standard input and output to communicate with the map and reduce
code, Pipes uses sockets as the channel over which the tasktracker communicates with

the process running the C++ map or reduce function. JNI is not used.

Avro Data Types and Schemas
Avro defines a small number of primitive data types, which can be used to build
application-specific data structures by writing schemas.

{
"type": "array",
"items": "long"
}
{
"type": "map",
"values": "string"
}
{
"type": "record",
"name": "StringPair",
"doc": "A pair of strings.",
"fields": [
{"name": "left", "type": "string"},
{"name": "right", "type": "string"}
]
}

Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(getClass().getResourceAsStream("StringPair.avsc"));
GenericRecord datum = new GenericData.Record(schema);
datum.put("left", "L");
datum.put("right", "R");

DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(schema);
Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null);
GenericRecord result = reader.read(null, decoder);

Using avro-maven-plugin
DatumWriter<StringPair> writer =
new SpecificDatumWriter<StringPair>(StringPair.class);
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(datum, encoder);

DatumReader<StringPair> reader =
new SpecificDatumReader<StringPair>(StringPair.class);
Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null);
StringPair result = reader.read(null, decoder);

Avro Datafiles
Avro’s object container file format is for storing sequences of Avro objects. It is very
similar in design to Hadoop’s sequence files,
File file = new File("data.avro");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter =
new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
dataFileWriter.append(datum);
dataFileWriter.close();

DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> dataFileReader =
new DataFileReader<GenericRecord>(file, reader);
GenericRecord record = null;
while (dataFileReader.hasNext()) {
record = dataFileReader.next(record);
// process record
}

Schema Resolution
{"name": "description", "type": "string", "default": "}
Another common use of a different reader’s schema is to drop fields in a record, an
operation called projection. This is useful when you have records with a large number
of fields and you want to read only some of them.

Another useful technique for evolving Avro schemas is the use of name aliases. Aliases
allow you to use different names in the schema used to read the Avro data than in the
schema originally used to write the data.
{"name": "first", "type": "string", "aliases": ["left"]},

Speculative Execution
The MapReduce model is to break jobs into tasks and run the tasks in parallel to make
the overall job execution time smaller than it would be if the tasks ran sequentially.

Parallel Copying with distcp for copying large amounts of data to and from Hadoop filesystems in parallel
hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
By default, distcp will skip files that already exist in the destination, but they can be
overwritten by supplying the -overwrite option. You can also update only the files that
have changed using the -update option.
hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar/foo

dfs.http.address property, which defaults to 50070.
hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/bar

Hadoop balancer tool
Hadoop Archives: HAR files
Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS
blocks more efficiently, thereby reducing namenode memory usage while still allowing
transparent access to files.

hadoop archive -archiveName files.har /my/files /my
hadoop fs -ls /my/files.har
hadoop fs -lsr har:///my/files.har
hadoop fs -lsr har:///my/files.har/my/files/dir
% hadoop fs -lsr har://hdfs-localhost:8020/my/files.har/my/files/dir

To delete a HAR file, you need to use the recursive form of delete because from the
underlying filesystem’s point of view, the HAR file is a directory:
% hadoop fs -rmr /my/files.har

HDFS is a filesystem designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.
Files in HDFS may be written to by a single writer. Writes are always made at the
end of the file. There is no support for multiple writers or for modifications at
arbitrary offsets in the file.

there are tools to perform filesystem maintenance, such as df and fsck, that operate on
the filesystem block level.

hadoop fsck / -files -blocks

Namenodes and Datanodes
The namenode manages the
filesystem namespace. It maintains the filesystem tree and the metadata for all the files
and directories in the tree. This information is stored persistently on the local disk in
the form of two files: the namespace image and the edit log. The namenode also knows
the datanodes on which all the blocks for a given file are located; however, it does
not store block locations persistently, because this information is reconstructed from
datanodes when the system starts.

HDFS Federation
HDFS Federation, introduced in the 2.x release series, allows a cluster to scale by adding
namenodes, each of which manages a portion of the filesystem namespace

HDFS High-Availability
there is a pair of namenodes in an activestandby
configuration. In the event of the failure of the active namenode, the standby
takes over its duties to continue servicing client requests without a significant interruption.
The namenodes must use highly available shared storage to share the edit log.
Datanodes must send block reports to both namenodes because the block mappings
are stored in a namenode’s memory, and not on disk.
• Clients must be configured to handle namenode failover, using a mechanism that
is transparent to users.

Failover and fencing
Failover controllers are pluggable, but the first

implementation uses ZooKeeper to ensure that only one namenode is active.

fs.default.name, set to hdfs://localhost/, default dfs port is 8020.

dfs.replication
Basic Filesystem Operations
hadoop fs -copyFromLocal input/docs/quangle.txt hdfs://localhost/user/tom/quangle.txt
hadoop fs -copyToLocal quangle.txt quangle.copy.txt
md5 input/docs/quangle.txt quangle.copy.txt

directories are treated as metadata and stored by the namenode, not the datanodes.

Hadoop Filesystems
Local file fs.LocalFileSystem A filesystem for a locally connected disk with clientside
checksums. Use RawLocalFileSystem for a local filesystem with no checksums.

HDFS hdfs hdfs.DistributedFileSystem
HFTP hftp hdfs.HftpFileSystem A filesystem providing read-only access to HDFS over
HTTP. Often used with distcp to copy data between HDFS
clusters running different versions.
HSFTP hsftp hdfs.HsftpFileSystem A filesystem providing read-only access to HDFS over
HTTPS.

WebHDFS webhdfs hdfs.web.WebHdfsFile
System
A filesystem providing secure read-write access to HDFS
over HTTP. WebHDFS is intended as a replacement for
HFTP and HSFTP.
HAR har fs.HarFileSystem A filesystem layered on another filesystem for archiving
files. Hadoop Archives are typically used for archiving files
in HDFS to reduce the namenode’s memory usage.
FTP ftp fs.ftp.FTPFileSystem A filesystem backed by an FTP server.
S3 (native) s3n fs.s3native.
NativeS3FileSystem
A filesystem backed by Amazon S3.

S3 (blockbased)
s3 fs.s3.S3FileSystem A filesystem backed by Amazon S3, which stores files in
blocks (much like HDFS) to overcome S3’s 5 GB file size
limit.
Distributed
RAID
hdfs hdfs.DistributedRaidFi
leSystem
A “RAID” version of HDFS designed for archival storage.
For each file in HDFS, a (smaller) parity file is created,
which allows the HDFS replication to be reduced from
three to two, which reduces disk usage by 25% to 30%
while keeping the probability of data loss the same. Distributed
RAID requires that you run a RaidNode daemon
on the cluster.
View viewfs viewfs.ViewFileSystem A client-side mount table for other Hadoop filesystems.
Commonly used to create mount points for federated

namenodes

The hadoop fs command has a -text option to display sequence files in textual form.
It looks at a file’s magic number so that it can attempt to detect the type of the file and
appropriately convert it to text.
hadoop fs -text numbers.seq | head

Sorting and merging SequenceFiles
The SequenceFile format
A sequence file consists of a header followed by one or more records
The first three bytes of a sequence file are the bytes SEQ, which acts as a magic number,
followed by a single byte representing the version number. The header contains other
fields, including the names of the key and value classes, compression details, userdefined
metadata, and the sync marker.14 Recall that the sync marker is used to allow
a reader to synchronize to a record boundary from any position in the file. Each file has
a randomly generated sync marker, whose value is stored in the header. Sync markers
appear between records in the sequence file. They are designed to incur less than a 1%
storage overhead, so they don’t necessarily appear between every pair of records.

Block compression compresses multiple records at once; it is therefore more compact
than and should generally be preferred over record compression because it has the
opportunity to take advantage of similarities between records.

Records
are added to a block until it reaches a minimum size in bytes, defined by the
io.seqfile.compress.blocksize property; the default is 1 million bytes. A sync marker
is written before the start of every block. The format of a block is a field indicating the
number of records in the block, followed by four compressed fields: the key lengths,
the keys, the value lengths, and the values.

the temporary outputs of maps are stored using SequenceFile.
The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting respectively.

MapFile
A MapFile is a sorted SequenceFile with an index to permit lookups by key. MapFile can
be thought of as a persistent form of java.util.Map.
writer = new MapFile.Writer(conf, fs, uri,key.getClass(), value.getClass());
key.set(i + 1);
value.set(DATA[i % DATA.length]);
writer.append(key, value);
If we look at the MapFile, we see it’s actually a directory containing two files called
data and index.

The data file contains all of the entries.
The index file contains a fraction of the keys and contains a mapping from the key to
that key’s offset in the data file.
by default only every 128th key is included in the index,
although you can change this value either by setting the io.map.index.interval
property or by calling the setIndexInterval() method on the MapFile.Writer instance.
A reason to increase the index interval would be to decrease the amount of memory
that the MapFile needs to store the index. Conversely, you might decrease the interval
to improve the time for random selection (since fewer records need to be skipped on
average) at the expense of memory usage.

reader.get(new IntWritable(496), value);
For this operation, the MapFile.Reader reads the index file into memory (this is cached
so that subsequent random access calls will use the same in-memory index). The reader
then performs a binary search on the in-memory index to find the key in the index that
is less than or equal to the search key.

Next, the reader seeks to this offset
in the data file and reads entries until the key is greater than or equal to the search key
(496). In this case, a match is found and the value is read from the data file.

MapFile variants
SetFile is a specialization of MapFile for storing a set of Writable keys. The keys
must be added in sorted order.
• ArrayFile is a MapFile where the key is an integer representing the index of the
element in the array and the value is a Writable value.
• BloomMapFile is a MapFile that offers a fast version of the get() method, especially
for sparsely populated files. The implementation uses a dynamic bloom filter for
testing whether a given key is in the map. The test is very fast because it is inmemory,

but it has a nonzero

Monday, December 23, 2013

Learning Hadoop Commands

FS Commands
http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/FileSystemShell.html

appendToFile
hdfs dfs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile
hdfs dfs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.

hdfs dfs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hdfs dfs -chgrp [-R] GROUP URI [URI ...]
hdfs dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]
hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
hdfs dfs -copyFromLocal <localsrc> URI
hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
hdfs dfs -count [-q] <paths>
hdfs dfs -cp [-f] URI [URI ...] <dest>

hdfs dfs -du [-s] [-h] URI [URI ...]
Displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.
The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files.
The -h option will format file sizes in a "human-readable" fashion (e.g 64.0m instead of 67108864)

hdfs dfs -expunge
Empty the Trash.

hdfs dfs -dus <args>
Displays a summary of file lengths. This is an alternate form of hdfs dfs -du -s.

hdfs dfs -get [-ignorecrc] [-crc] <src> <localdst>
Copy files to the local file system.
hdfs dfs -get hdfs://nn.example.com/user/hadoop/file localfile

hdfs dfs -ls <args>
hdfs dfs -mkdir [-p] <paths>
hdfs dfs -rm [-skipTrash] URI [URI ...]
Delete files specified as args. Only deletes non empty directory and files

hdfs dfs -rmr [-skipTrash] URI [URI ...] ==> usr rm -r
Recursive version of delete.

http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YarnCommands.html
yarn jar <jar> [mainClass] args
yarn application -list
yarn node -list

Commonly Used Windows PowerShell Cmdlets

Extract lines from files
Get first 10 lines as head -10 in linux
Get-Content -Path my.csv -TotalCount 10
Get last 10 lines as tail -10 in Linux
Get-Content -Path my.csv | Select-Object -Last 10
Get m-to-n lines:
Get-Content -Path my.csv | Select-Object -Index(10)
Get the 10th to 100th lines
Get-Content -Path my.csv | Select-Object -Index(10..100)
# get 10th and 100th lines
Get-Content -Path my.csv | Select-Object -Index(10, 100)

Search recursively for a certain string within files
Get-ChildItem -Recurse -Filter *.log | Select-String Exception
Get-ChildItem -Recurse -Filter *.log | Select-String -CaseSensitive -Pattern Exception

Tail -f in PowerShell
Get-Content error.log -wait | Where-Object { $_ -match "Exception" }
-match is case-insensitive. -cmath is case-sensitive.

Find the five processes using the most memory

Get-Process | Sort-Object -Property WS -Descending | Select-Object -First 10

Delete all files within a directory

Remove-Item foldername -Recurse

Using Get-WmiObject
List all WMI classes:
Get-WmiObject -List
Get-WmiObject -Class Win32_ComputerSystem
Get-WmiObject -Class Win32_BIOS -ComputerName .
gwmi win32_service -filter "name like 'Oracle%'" | select name
gwmi win32_service -filter "startmode='auto'" | select name,startmode
(gwmi win32_service -filter "name='alerter'").StopService()
Restart the current computer

(Get-WmiObject -Class Win32_OperatingSystem -ComputerName .).Win32Shutdown(2)

Miscs
Rename all .TXT files as .LOG files in the current directory:
Get-Childitem -Path *.txt | rename-item -NewName {$_.name -replace ".txt",".log"}
Restart-Computer –Force –ComputerName TARGETMACHINE
Run a script on a remote computer

invoke-command -computername machine1, machine2 -filepath c:\Script\script.ps1

Saturday, December 21, 2013

Learning PowerShell

$PSVersionTable

PowerShell v3 has two components: the standard, text-based console host (Power-
Shell.exe) and the more visual Integrated Scripting Environment (ISE; PowerShell_ISE.exe)

Add-WindowsFeature powershell-ise

$PSVersionTable

you can run PowerShell.exe -version 2.0 to explicitly run v2.

PowerGUI
PowerSE and PowerWF
Idera PowerShell Plus

help Get-Service
update-help

Help *log*
Help *event*
get-alias -Definition "Get-Service"

A cmdlet is a native PowerShell command-line utility.
Taking shortcuts
Truncating parameter names
PowerShell doesn’t force you to type out entire parameter names
Parameter name aliases

Positional parameters

Get-ChildItem c:\users
move -Path c:\file.txt -Destination \users\donjones\

Show-Command Get-EventLog

Working with providers

Get-PSProvider
A PowerShell provider, or PSProvider, is an adapter. It’s designed to take some kind
of data storage and make it look like a disk drive.

A PSDrive uses a single provider to connect to some actual data storage.
Get-PSDrive

The PSProvider adapts the data
store, the PSDrive makes it accessible, and you use a set of cmdlets to see and manipulate
the data exposed by each PSDrive.

get-command -noun *item*

Set-Location -Path C:\Windows
cd 'C:\Program Files'
new-item testFolder
Type: directory

mkdir test2
-Type Directory

PowerShell’s solution is to provide an alternate -LiteralPath parameter. This
parameter doesn’t accept wildcards.

set-location -Path hkcu:
set-location -Path software
get-childitem
Set-ItemProperty -Path dwm -PSPropert EnableAeroPeek -Value 0

The pipeline: connecting commands
Dir | More

Exporting to a CSV or an XML file
Get-Process (or Ps)
Get-Service (or Gsv)
Get-EventLog Security -newest 100

Get-Process | Export-CSV procs.csv

Export-CliXML | Import-CliXML
Import-CSV
Get-Command -verb Import or Export

Comparing files: Compare-Object | Diff
-ReferenceObject, -DifferenceObject, and -Property

Get-Process | Export-CliXML reference.xml
Diff -reference (Import-CliXML reference.xml) -difference (Get-Process) -property Name

The parameter names are -referenceObject and -differenceObject;

Dir > DirectoryList.txt
Dir | Out-File DirectoryList.txt
PowerShell has a variety of Out- cmdlets. One is called Out-Default, and it’s the one
the shell uses when you don’t specify a different Out- cmdlet

Get-Command Out*
Get-Command -verb Out

Out-Printer
ut-GridView
Get-Service | Out-GridView
Out-Null and Out-String
Get-Service | ConvertTo-HTML

Get-Service | ConvertTo-HTML | Out-File services.html
ConvertTo-CSV and ConvertTo-XML
Export-CSV or Export-CliXML

Get-Process | Stop-Process
Get-Process -name Notepad | Stop-Process
Get-Service can be piped to cmdlets
like Stop-Service, Start-Service, Set-Service

all
cmdlets that modify the system have an internally defined impact level. The cmdlet’s
creator sets this impact level and it can’t be changed. The shell has a corresponding
$ConfirmPreference setting, which is set to High by default.

$confirmpreference
Get-Service | Stop-Service -confirm

-whatif
get-process | stop-process -whatif
This tells you what the cmdlet would have done, without letting the cmdlet do it.

Get-Content .\events.csv
import-csv .\events.csv
The Import- cmdlets pay attention to what’s in the file, attempt
to interpret it, and create a display that looks more like the output of the original
command (Get-EventLog, in this case).

Microsoft Management Console (MMC)

Discovering objects: Get-Member
To learn more about an object, you use a different
command: Get-Member.
Get-Process | Gm

? ScriptProperty
? Property
? NoteProperty
? AliasProperty
Get-Process -Name Notepad | Stop-Process
Stop-Process -name Notepad

Get-Process | Sort-Object -property VM
-descending
Get-Process | Sort VM -desc
Get-Process | Sort VM,ID -desc

Selecting the properties you want: Select-Object
Get-Process | Select-Object -property Name,ID,VM,PM | Convert-ToHTML | Out-File test2.html
Select-Object is used to choose the properties (or columns) you want to
see. Where-Object removes, or filters, objects out of the pipeline
based on some criteria you specify.
PSObjects

Get-Process | Sort VM -descending | Select Name,ID,VM | gm

Get-Content .\computers.txt | Get-Service

pipeline parameter binding:
The first method the
shell will try is called ByValue; if that doesn’t work, it’ll try ByPropertyName.

pipeline input ByValue
For the most part, commands sharing the
same noun (as Get-Process and Stop-Process do) can usually pipe to each other
ByValue.

pipeline input ByPropertyName
import-csv .\aliases.csv | new-alias

custom properties
import-csv .\newusers.csv |
select-object -property *, @{name='samAccountName';expression={$_.login}},
@{label='Name';expression={$_.login}}, @{n='Department';e={$_.Dept}}

Parenthetical commands
Get-WmiObject -class Win32_BIOS -ComputerName (Get-Content .\computers.txt)

Extracting the value from a single property
get-adcomputer -filter * -searchbase "ou=domain controllers, dc=company,dc=pri"

Select-Object cmdlet includes an
-expandProperty parameter, which accepts a property name. It will take that property,
extract its values, and return those values as the output of Select-Object.

-Expand Name goes into the Name
property and extracts its values, resulting in simple strings being returned
from the command.
Get-Service -computerName (get-adcomputer -filter * -searchbase "ou=domain controllers,dc=company,dc=pri" | Select-Object -expand name)

Select-Object and its -Property parameter: it doesn’t change the
fact that you’re outputting an entire object.

Get-Process -computerName (import-csv .\computers.csv | select -expand hostname)

With Select -Property, you’re deciding
what boxes you want, but you’ve still got boxes. With Select -ExpandProperty,
you’re extracting the contents of the box, and getting rid of the box entirely. You’re
left with the contents.

About the default formatting
.format.ps1xml files that install with PowerShell.
formatting directions for process objects are in DotNetTypes.format.
ps1xml.

cd $pshome
notepad dotnettypes.format.ps1xml

Filtering and comparisons
early filtering
Get-Service -name e*,*s*
Get-ADComputer -filter "Name -like '*DC'"

Filter left
Where-Object| Where uses a generic syntax, and you can use it to filter any kind of object once you’ve
retrieved it and put it into the pipeline.

Comparison operators: -eq, -ne, -ge, -le, -gt, -lt
For string comparisons, you can also use a separate set of case-sensitive operators if
needed: -ceq, -cne, -cgt, -clt, -cge, -cle.
-not -and, -or
-like, -notlike, use -clike and -cnotlike for case-sensitive comparisons.
-match makes a comparison between a string of text and a regular expression
pattern. -notmatch is its logical opposite, and as you might expect, -cmatch and
-cnotmatch provide case-sensitive versions.

Filtering objects out of the pipeline
Get-Service | Where-Object -filter { $_.Status -eq 'Running' }

Get-WmiObject -Class Win32_Service |
Where { $_.State -ne 'Running' -and $_.StartMode -eq 'Auto'}

Power-
Shell Iterative Command-Line Model, or PSICLM
Get-Process | Where-Object -filter { $_.Name -notlike 'powershell*' } |
Sort VM -descending | Select -first 10 |
Measure-Object -property VM -sum

Register-ScheduledTask -TaskName "ResetAccountingPrinter" -Descript
ion "Resets the Accounting print queue at 3am daily" -Action (New-Scheduled
TaskAction -Execute 'Get-PrintJob -printer "Accounting"') -Trigger (New-Sch
eduledTaskTrigger -daily -at '3:00 am')

Remote control: one to one, and one to many
PowerShell uses a new communications protocol called Web Services
for Management (WS-MAN).

Multitasking with background jobs
start-job -scriptblock { dir }
start-job -scriptblock {
get-eventlog security -computer server-r2
}

Get-History 32 -count 32

PowerShell: Sending Http Request

Invoke-WebRequest

http://technet.microsoft.com/en-us/library/hh849901.aspx

$r = Invoke-WebRequest -URI http://www.bing.com?q=how+many+feet+in+a+mile

$r.AllElements | where {$.innerhtml -like "*=*"} | Sort { $.InnerHtml.Length } | Select InnerText -First 5

PowerShell: Executing a .NET Web Request

http://blog.newslacker.net/2012/03/powershell-executing-net-web-request.html

# // first argument is mapped to $url

$url="http://localhost:8983/solr/select?fl=contentid&q=contentid"

param($url)

# // create a request

[Net.HttpWebRequest] $req = [Net.WebRequest]::create($url)

$req.Method = "GET"

$req.Timeout = 600000 # = 10 minutes

# // Set if you need a username/password to access the resource

#$req.Credentials = New-Object Net.NetworkCredential("username", "password");

[Net.HttpWebResponse] $result = $req.GetResponse()

[IO.Stream] $stream = $result.GetResponseStream()

[IO.StreamReader] $reader = New-Object IO.StreamReader($stream)

[string] $output = $reader.readToEnd()

$stream.flush()

$stream.close()

# // return the text of the web page

Write-Host $output

http://www.powershellpro.com/powershell-tutorial-introduction/variables-arrays-hashes/
$_ – Contains the current pipeline object, used in script blocks, filters, and the where statement.
$Args – Contains an array of the parameters passed to a function.
$Error – Contains objects for which an error occurred while being processed in a cmdlet.
$Home – Specifies the user’s home directory.
$PsHome – The directory where the Windows PowerShell is installed.

http://ss64.com/ps/syntax-arrays.html
$myArray = 1,"Hello",3.5,"World"
or using explicit syntax:
$myArray = @(1,"Hello",3.5,"World")
$myArray = (1..7)

or strongly typed:
[int[]] $myArray = 12,64,8,64,12
Create an empty array:
$myArray = @()
Create an array with a single element:
$myArray = @("Hello World")
Create a Multi-dimensional array:
$myMultiArray = @((1,2,3),(40,50,60))
Add values to an Array.
$monthly_sales += 25
$myArray[0]
$myArray
$myArray[4..9]
$myArray[-1]
$myArray.length-1
foreach ($element in $myArray) {$element}

ow can I use Windows PowerShell to check a string to see if it contains another string?
PS C:\> $a = [string]"This string contains a number of letters"
PS C:\> $a -match 'a number'
$index = "The string".IndexOf(" ")

http://ss64.com/ps/syntax-compare.html
-eq Equal
-ne Not equal
-ge Greater than or equal
-gt Greater than
-lt Less than
-le Less than or equal
-like Wildcard comparison
-notlike Wildcard comparison
-match Regular expression comparison
-notmatch Regular expression comparison
-replace Replace operator
-contains Containment operator

-notcontains Containment operator

Running Scripts Without Starting Windows PowerShell
powershell.exe -noexit &'c:\my scripts\test.ps1'
powershell.exe -noexit get-childitem c:\scripts
Set objShell = CreateObject("Wscript.Shell")
objShell.Run("powershell.exe -noexit c:\scripts\test.ps1")

Run a PowerShell script
http://ss64.com/ps/syntax-run.html
The most common (default) way to run a script is by calling it:
PS C:\> & "C:\Belfry\My first Script.ps1"
If the path does not contain any spaces, then you can omit the quotes and the '&' operator
PS C:\> C:\Belfry\Myscript.ps1
If the script is in the current directory, you must indicate this using .\ (or ./ will also work)
PS C:\> .\Myscript.ps1
Dot Sourcing
When you dot source a script, all variables and functions defined in the script will persist even when the script ends.
Run a script by dot-sourcing it:
PS C:\> . "C:\Belfry\My first Script.ps1"

http://www.joynt.co.uk/kb/scripting/powershell/stopwatch
$elapsed = [System.Diagnostics.Stopwatch]::StartNew()
write-host "Total Elapsed Time: $($elapsed.Elapsed.ToString())"

http://technet.microsoft.com/en-us/magazine/jj554301.aspx
Param(
[string]$computerName,
[string]$filePath
)
[CmdletBinding()]
Param(
[Parameter(Mandatory=$True,Position=1)]
[string]$computerName,

[Parameter(Mandatory=$True)]
[string]$filePath
)
param([String[]] $Hosts, [String] $VLAN)

Get-Content one.sql,two.sql,three.sql,four.sql,five.sql > master.sql
Get-Content [-Path] <String[]> [-Credential <PSCredential> ] [-Exclude <String[]> ] [-Filter <String> ] [-Force] [-Include <String[]> ] [-ReadCount <Int64> ] [-Tail <Int32> ] [-TotalCount <Int64> ] [-UseTransaction] [ <CommonParameters>]

A foreach loop doesn't ouput to the pipeline. You can make it do that by making the loop a sub-expression:

Check file existence
if(!(Test-Path -Path $path))
If(-not(Test-Path -Path $path))

http://poshoholic.com/2007/08/21/essential-powershell-understanding-foreach/
The foreach statement does not use pipelining. Instead, the right-hand side of the in operator is evaluated to completion before anything else is done. For our example above, the Get-Command cmdlet is called and the results are completely loaded into memory before the interior script block is executed. This means you have to have enough memory to store all of the objects when you run the script.
In contrast, the foreach alias, or ForEach-Object cmdlet, does use pipelining. When the second example is used, Get-Command is called and it starts to return the commands one at a time. As each object is returned out of the Get-Command cmdlet, it is sent into the pipeline and execution continues in the next section of the pipeline. In this case, the foreach alias gets executed and the object is run through the process script block of ForEach-Object.

http://blogs.technet.com/b/gbordier/archive/2009/05/05/powershell-and-writing-files-how-fast-can-you-write-to-a-file.aspx
use the ‘>>’' alias that calls into the out-file cmd-let
use export-csv
use .Net
use .Net StreamWriter to write the file
$stream = [System.IO.StreamWriter] "t.txt"
1..10000 | % {
$stream.WriteLine($s)
}
$stream.close()

You can expand ..\frag to its full path with resolve-path:
PS > resolve-path ..\frag
[IO.Path]::GetFullPath( "fred\frog\..\frag" )
or more interestingly
[IO.Path]::GetFullPath( (join-path "fred\frog" "..\frag") )

PowerShell has really wacky return semantics - at least when viewed from a more traditional programming perspective. There are two main ideas to wrap your head around:

All output is captured, and returned
The return keyword really just indicates a logical exit point
Thus, the following two script blocks will do effectively the exact same thing:

$a = "Hello, World"
return $a
$a = "Hello, World"
$a
return

The $a variable in the second example is left as output on the pipeline and, as mentioned, all output is returned. In fact, in the second example you could omit the return entirely and you would get the same behavior (the return would be implied as the function naturally completes and exits).

Write-Host is typically the only one I frequently use—in particular with its NoNewLine switch. What NoNewLine provides is a simple way to join different lines together.
Write-Host –NoNewLine "Counting from 1 to 9 (in seconds): "
foreach($element in 1..9){
Write-Host –NoNewLine "${element} "
Start-Sleep –Seconds 1
}
Write-Host ""
Something else that Write-Host is useful for is when one might want to add color in their output.
Write-Host –NoNewLine "${element} " -BackgroundColor "Red" -ForegroundColor "Black"
In Windows PowerShell, the host is what you, as a user, interact with.

Write-Host: Write directly to the console, not included in function/cmdlet output. Allows foreground and background colour to be set.
Write-Debug: Write directly to the console, if $DebugPreference set to Continue or Stop.

Write-Verbose: Write directly to the console, if $VerbosePreference set to Continue or Stop.