JYuan Learning Log: 2015

Wednesday, March 4, 2015

Linux Bash

http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-7.html
for i in $( ls ); do
echo item: $i
done
for i in `seq 1 10`;
do
echo $i
done

COUNTER=0
while [ $COUNTER -lt 10 ]; do
echo The counter is $COUNTER
let COUNTER=COUNTER+1
done
COUNTER=20
until [ $COUNTER -lt 10 ]; do
echo COUNTER $COUNTER
let COUNTER-=1
done

http://tldp.org/LDP/abs/html/loops1.html
for arg in [list] ; do
http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_09_01.html
for i in `cat list`; do cp "$i" "$i".bak ; done
for i in "$LIST"; do

http://www.cyberciti.biz/faq/bash-for-loop/
for i in {1..5}
for i in {0..10..2}
The seq command (outdated)
for i in $(seq 1 2 20)
Three-expression bash for loops syntax
for (( c=1; c<=5; c++ ))
do
echo "Welcome $c times"
done
How do I use for as infinite loops?
for (( ; ; ))

http://code.tutsplus.com/tutorials/how-to-customize-the-command-prompt--net-20586
PS1='->'
source ~/.bashrc
PROMPT_COMMAND='echo "comes before the prompt"'
print_before_the_prompt () {
echo "$USER: $PWD"
}

PROMPT_COMMAND=print_before_the_prompt
http://compositecode.com/2014/10/09/bashit-just-a-custom-bash-prompt-setup-for-git/

https://medium.com/@mandymadethis/pimp-out-your-command-line-b317cf42e953
alias sub=’open -a “Sublime Text”’

http://stackoverflow.com/questions/20296664/how-to-uninstall-bash-it

http://mywiki.wooledge.org/BashFAQ/001
while IFS= read -r line; do
printf '%s\n' "$line"
done < "$file"
In the scenario above IFS= prevents trimming of leading and trailing whitespace. Remove it if you want this effect.
while read -r line; do
printf '%s\n' "$line"
done <<< "$var"
while read -r line; do
printf '%s\n' "$line"
done <<EOF
$var
EOF
while read -r line; do
[[ $line = \#* ]] && continue
printf '%s\n' "$line"
done < "$file"
# Input file has 3 columns separated by white space.
while read -r first_name last_name phone; do
# Only print the last name (second column)
printf '%s\n' "$last_name"
done < "$file"
# Extract the username and its shell from /etc/passwd:
while IFS=: read -r user pass uid gid gecos home shell; do
printf '%s: %s\n' "$user" "$shell"
done < /etc/passwd
For tab-delimited files, use IFS=$'\t'.
read -r first last junk <<< 'Bob Smith 123 Main Street Elk Grove Iowa 123-555-6789'
some command | while read -r line; do
printf '%s\n' "$line"
done
find . -type f -print0 | while IFS= read -r -d '' file; do
mv "$file" "${file// /_}"
done

Note the usage of -print0 in the find command, which uses NUL bytes as filename delimiters; and -d '' in the read command to instruct it to read all text into the file variable until it finds a NUL byte. By default, find and read delimit their input with newlines; however, since filenames can potentially contain newlines themselves, this default behaviour will split up those filenames at the newlines and cause the loop body to fail. Additionally it is necessary to set IFS to an empty string, because otherwise read would still strip leading and trailing whitespace.
http://mywiki.wooledge.org/IFS
IFS = input field separator
In the read command, if multiple variable-name arguments are specified, IFS is used to split the line of input so that each variable gets a single field of the input. (The last variable gets all the remaining fields, if there are more fields than variables.)
When performing WordSplitting on an unquoted expansion, IFS is used to split the value of the expansion into multiple words.

Better Logging

http://perf4j.codehaus.org/devguide.html
http://pramod-musings.blogspot.com/2012/06/using-perf4j-to-time-methods.html
http://www.infoq.com/articles/perf4j
http://heshans.blogspot.com/2014/01/aspect-oriented-programming-with-java.html

http://www.nurkiewicz.com/2010/05/clean-code-clean-logs-use-appropriate.html
http://vasir.net/blog/development/how-logging-made-me-a-better-developer
Visibility into code helps manage complexity.
Communication

http://www.javacodegeeks.com/2011/01/10-tips-proper-application-logging.html
SLF4J is the best logging API available, mostly because of a great pattern substitution support
log.debug("Found {} records matching filter: '{}'", records, filter);
SLF4J is just a façade. As an implementation I would recommend the Logback framework.
Perf4J

3) Do you know what you are logging?
read your logs often to spot incorrectly formatted messages.
avoid NPE
logging collections
log.debug("Returning users: {}", users);
It is a much better idea to log, for example, only ids of domain objects (or even only size of the collection).
in Java we can emulate it using the Commons Beanutils library:
return CollectionUtils.collect(collection, new BeanToPropertyValueTransformer(propertyName));
the improper implementation or usage of toString(). First, create toString() for each class that appears anywhere in logging statements, preferably using ToStringBuilder (but not its reflective counterpart). Secondly, watch out for arrays and non-typical collections. Arrays and some strange collections might not have toString() implemented calling toString() of each item. Use Arrays #deepToString JDK utility method.

Avoid side effects

5) Be concise and descriptive
log.debug("Message with id '{}' processed", message.getJMSMessageID());
Don't log passwords and any personal information

6) Tune your pattern
logging date when your logs roll every hour is pointless as the date is already included in the log file name. On the contrary, without logging the thread name you would be unable to track any process using logs when two threads work concurrently – the logs will overlap.
current time (without date, milliseconds precision), logging level, name of the thread, simple logger name (not fully qualified) and the message.
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%d{HH:mm:ss.SSS} %-5level [%thread][%logger{0}] %m%n</pattern>
</encoder>
</appender>
You should never include file name, class name and line number, although it’s very tempting.
Besides, logging class name, method name and/or line number has a serious performance impact.

7) Log method arguments and return values
You can even use a simple AOP aspect to log a wide range of methods in your code. This reduces code duplication, but be careful, since it may lead to enormous amount of huge logs.

8) Watch out for external systems

9) Log exceptions properly
Avoid logging exceptions, let your framework or container (whatever it is) do it for you.
Log, or wrap and throw back (which is preferable), never both, otherwise your logs will be confusing.
log.error("Error reading configuration file", e); //L

10) Logs easy to read, easy to parse
avoid formatting of numbers, use patterns that can be easily recognized by regular expressions, etc. If it is not possible, print the data in two formats:
log.debug("Request TTL set to: {} ({})", new Date(ttl), ttl);
final String duration = DurationFormatUtils.formatDurationWords(durationMillis, true, true);
log.info("Importing took: {}ms ({})", durationMillis, duration);

Log4j MDC(Mapped Diagnostic Context)
http://veerasundar.com/blog/2009/10/log4j-mdc-mapped-diagnostic-context-what-and-why/
http://veerasundar.com/blog/2009/11/log4j-mdc-mapped-diagnostic-context-example-code/
http://logging.apache.org/log4j/2.x/manual/thread-context.html
ThreadContext.push(UUID.randomUUID().toString()); // Add the fishtag;
ThreadContext.pop();
ThreadContext.put("id", UUID.randomUUID().toString(); // Add the fishtag;
ThreadContext.clear();
A Filter to put the user name in MDC for every request call
import org.apache.log4j.MDC;
MDC.put("userName", "veera");
MDC.remove("userName");
log4j.appender.consoleAppender.layout.ConversionPattern = %-4r [%t] %5p %c %x - %m - %X{userName}%n
https://lizdouglass.wordpress.com/tag/log4j-ndc/
NDC is an object that Log4j manages per thread as a stack of contextual information.

Thursday, January 29, 2015

Lucene Jira Bugs

https://issues.apache.org/jira/browse/LUCENE-2606
Compares two strings with a collator, also looking to see if the the strings
* are impacted by jdk bugs. may not avoid all jdk bugs in tests.
* see https://bugs.openjdk.java.net/browse/JDK-8071862
java.text.Collator

To learn more
https://issues.apache.org/jira/browse/LUCENE-6192
Long overflow in LuceneXXSkipWriter can corrupt skip data
// be careful when cast from long to int etc.
skipBuffer.writeVInt((int) (curDocPointer - lastSkipDocPointer[level]));
To
skipBuffer.writeVLong(curDocPointer - lastSkipDocPointer[level]);

Wednesday, January 28, 2015

Learn Lucene-Solr Bugs

https://issues.apache.org/jira/browse/SOLR-6954
Considering changing SolrClient#shutdown to SolrClient#close.
SolrCient implments Closable interface
==> so we can use try-with-resource

Solr 7024: improve java detection and error message
if [ -z "$JAVA" ]; then
    echo >&2 "The currently defined JAVA_HOME ($JAVA_HOME) refers"
    echo >&2 "to a location where Java could not be found.  Aborting."
    echo >&2 "Either fix the JAVA_HOME variable or remove it from the"
    echo >&2 "environment so that the system PATH will be searched."
    exit 1
  fi

$JAVA -version >/dev/null 2>&1 || {
  echo >&2 "Java not found, or an error was encountered when running java."
  echo >&2 "A working Java 8 is required to run Solr!"
  echo >&2 "Please install Java 8 or fix JAVA_HOME before running this script."
  echo >&2 "Command that we tried: '${JAVA} -version'"
  echo >&2 "Active Path:"
  echo >&2 "${PATH}"
  exit 1
}

https://issues.apache.org/jira/browse/SOLR-6449
Add first class support for Real Time Get in Solrj
public SolrDocumentList getById(Collection<String> ids, SolrParams params) throws SolrServerException {
    if (ids == null || ids.isEmpty()) {
      throw new IllegalArgumentException("Must provide an identifier of a document to retrieve.");
    }

    ModifiableSolrParams reqParams = new ModifiableSolrParams(params);
    if (StringUtils.isEmpty(reqParams.get(CommonParams.QT))) {
      reqParams.set(CommonParams.QT, "/get");
    }
    reqParams.set("ids", (String[]) ids.toArray());

    return query(reqParams).getResults();
  }

https://issues.apache.org/jira/browse/SOLR-7013
Unclear error message with solr script when lacking jar executable
hasJar=$(which jar 2>/dev/null)
hasUnzip=$(which unzip 2>/dev/null)

if [ ${hasJar} ]; then
  unzipCommand="$hasJar xf"
else
  if [ ${hasUnzip} ]; then
    unzipCommand="$hasUnzip"
  else
    echo -e "This script requires extracting a WAR file with either the jar or unzip utility, please install these utilities or contact your administrator for assistance."
    exit 1
  fi
fi

https://issues.apache.org/jira/browse/SOLR-6521
final Object lock = locks.get(Math.abs(Hash.murmurhash3_x86_32(collection, 0, collection.length(), 0) % locks.size()));

Add Long/FixedBitSet and replace usage of OpenBitSet
https://issues.apache.org/jira/browse/LUCENE-5440
http://lucene.markmail.org/thread/35gw3amo53dsqsqj
==> when a lot of data and computation, squeeze every uneeded operation
So clearly FBS is faster than OBS (perhaps unless you use fastSet/Get) since it doesn't need to do bounds checking.
Also, FBS lets your grow itself by offering a convenient copy constructor which allows to expand/shrink the set.

SOLR-7050: realtime get should internally load only fields specified in fl [Performance]
https://issues.apache.org/jira/browse/SOLR-7050
== Only load needed fields when call search.doc
StoredDocument luceneDocument = searcher.doc(docid);
changed to:
StoredDocument luceneDocument = searcher.doc(docid, rsp.getReturnFields().getLuceneFieldNames());

SOLR-6845: Suggester tests start new cores instead of reloading
https://issues.apache.org/jira/browse/SOLR-6845
LOG.info("reload(" + name + ")");
init
else if (getStoreFile().exists()) {
if (LOG.isDebugEnabled()) {
LOG.debug("attempt reload of the stored lookup from file " + getStoreFile());
}

SOLR-6909: Extract atomic update handling logic into AtomicUpdateDocumentMerger
Allow pluggable atomic update merging logic
util method: int docid = searcher.getFirstMatch(new Term(idField.getName(), idBytes));

SOLR-6931: We should do a limited retry when using HttpClient.
// always call setUseRetry, whether it is in config or not
HttpClientUtil.setUseRetry(httpClient, config.getBool(HttpClientUtil.PROP_USE_RETRY, true));

SOLR-6932: All HttpClient ConnectionManagers and SolrJ clients should always be shutdown in tests and regular code.
change HttpClient to CloseableHttpClient
all of these type of things should be made closeable - including SolrJ clients for 5.0 (rather than shutdown).

SOLR-6324: Set finite default timeouts for select and update.
Currently HttpShardHandlerFactory and UpdateShardHandler default to infinite timeouts for socket connection and read. This can lead to undesirable behaviour, for example, if a machine crashes, then searches in progress will wait forever for a result to come back and end up using threads which will only get terminated at shutdown.
clientParams.set(HttpClientUtil.PROP_CONNECTION_TIMEOUT, connectionTimeout);
this.defaultClient = HttpClientUtil.createClient(clientParams);
set socketTimeout and connTimeout for shardHandlerFactory in solr.xml

SOLR-6643: Fix error reporting & logging of low level JVM Errors that occur when loading/reloading a SolrCore
Great example about how to reproduce the problem and add test cases.
CoreContainerCoreInitFailuresTest.testJavaLangErrorFromHandlerOnStartup

SOLR-4839: Upgrade to Jetty 9
set persistTempDirectory to true
Jetty 9 has builtin support for disabling protocols (POODLE)

excludeProtocols: SSLv3

SOLR-6950: Ensure TransactionLogs are closed with test ObjectReleaseTracker.
assert ObjectReleaseTracker.track(this);
assert ObjectReleaseTracker.release(this);
// integration test in assert mode
// use ObjectReleaseTracker to make sure resource is closed and released

Windows Bat
SOLR-6928: solr.cmd stop works only in english
change
For /f "tokens=5" %%j in ('netstat -aon ^| find /i "listening" ^| find ":%SOLR_PORT%"') do (
to
For /f "tokens=5" %%j in ('netstat -aon ^| find "TCP " ^| find ":%SOLR_PORT%"') do (
One related edit is that the find command should look for ":8983 " (with a space after the port number) to avoid matching other ports, e.g. the following stop command would select two lines in netstat output since :1234 will also match :12345
solr start -p 1234
solr start -p 12345
solr stop -p 1234

SOLR-7016: Fix bin\solr.cmd to work in a directory with spaces in the name.
Add "": "%SOLR_TIP%\bin"
START /B "Solr-%SOLR_PORT%" /D "%SOLR_SERVER_DIR%" "%JAVA%" -server -Xss256k %SOLR_JAVA_MEM% %START_OPTS% -Dlog4j.configuration="%LOG4J_CONFIG%" -DSTOP.PORT=!STOP_PORT! -DSTOP.KEY=%STOP_KEY% ^

SOLR-7024: bin/solr: Improve java detection and error messages

SOLR-7013: use unzip if jar is not available (merged from r1653943)
https://issues.apache.org/jira/browse/SOLR-6787

Minors
https://issues.apache.org/jira/browse/SOLR-7059
Using paramset with multi-valued keys leads to a 500
Actually map in MapSolrParams is changed from Map<String,String> to Map<String,Object>

https://issues.apache.org/jira/browse/SOLR-7046
NullPointerException when group.function uses query() function

(Map) context = ValueSource.newContext(searcher);

The variable context is always null because it's scope is local to this function, but it gets passed on to another function later.

Not fixed yet
https://issues.apache.org/jira/browse/SOLR-6640

Don't understand
regression in /update/extract ? ref guide examples of fmap & xpath don't seem to be working
https://issues.apache.org/jira/browse/SOLR-6856

Tuesday, January 20, 2015

Solr Custom Score

Using CustomScoreQuery For Custom Solr/Lucene Scoring
When you implement your own Lucene query, you’re taking control of two things:

Matching – what documents should be included in the search results
Scoring – what score should be assigned to a document (and therefore what order should they appear in)

You’ve utilized Solr’s extensive set of query parsers & features including function queries, joins, etc. None of this solved your problem
You’ve exhausted the ecosystem of plugins that extend on the capabilities in (1). That didn’t work.
You’ve implemented your own query parser plugin that takes user input and generates existing Lucene queries to do this work. This still didn’t solve your problem.
You’ve thought carefully about your analyzers – massaging your data so that at index time and query time, text lines up exactly as it should to optimize the behavior of existing search scoring. This still didn’t get what you wanted.
You’ve implemented your own custom Similarity that modifies how Lucene calculates the traditional relevancy statistics – query norms, term frequency, etc.
You’ve tried to use Lucene’s CustomScoreQuery to wrap an existing Query and alter each documents score via a callback. This still wasn’t low-level enough for you, you needed even more control.

http://dev.fernandobrito.com/2012/10/building-your-own-lucene-scorer/

Query query = parser.parse("searching something");

CustomScoreQuery customQuery = new MyOwnScoreQuery(query);

ScoreDoc[] hits = searcher.search(customQuery.createWeight(searcher), null, numHits).scoreDocs;