103.7 Search text files using regular expressions

Weight: 2

Candidates should be able to manipulate files and text data using regular expressions. This objective includes creating simple regular expressions containing several notational elements. It also includes using regular expression tools to perform searches through a filesystem or file content.

Objectives

  • Create simple regular expressions containing several notational elements.
  • Use regular expression tools to perform searches through a filesystem or file content.
  • grep
  • egrep
  • fgrep
  • sed
  • regex

Regex

Regular expression, Regex, regexp is a pattern to describe what we want to match from a text. Here will discuss the form of regex which is used with the grep (generalised regular expression processor) command.

There is two kind of regex in GNU grep: Basic an Extended.

Basic blocks

Adding two expressions

If you need to add (concat) two expressions, just write them after each other.

Regex Will match
a ali, mina, hamid, jadi
na nasim, mina, nananana batman, mona

Repeating

  • The * means repeating the previous character for 0 or more
  • The + means repeating the previous character for 1 or more
  • the ? means zero or one repeats
  • {n,m} The item is matched at least n times, but not more than m times
Regex Will match Note
a*b ab, aaab, aaaaab, aaabthis
a*b b, mobser Because there is a b here with zero a before it
a+b ab, aab, aaabenz wont match sober or b because there needs to be at lear one a
a?b ab, aab, b, batman (zero a then b), ... .

Alternation (|)

If you say a\|b it will match a or b.

Character Classes

The dot (.) means any character. So .. will match anything with at least two character in it. You can also create your own classes with [abc] which will match a or b or c and [a-z] which match a to z.

You can also refer to digits with \d and

Ranges

There are easy ways to commonly used classes. Named classes open with [: and close with :]

Range Meaning
[:alnum:] Alphanumeric characters
[:blank:] Space and tab characters
[:digit:] The digits 0 through 9 (equivalent to 0-9)
[:upper:] and [:lower:] Upper and lower case letters, respectively.
^ (negation) As the first character after [ in a character class negates the sense of the remaining characters

A common form is .* which matches any character (zero or any length).

Matching specific locations

  • The caret ^ means beginning of the string
  • The dollar $ means the end of the string

Samples

  • ^a.* Matches anything that starts with a
  • ^a.*b$ Matches anything that starts with a and ends with b
  • ^a.*\d+.*b$ Matches anything starting with a, have some digits in the middle and end with b
  • ^(l|b)oo Matches anything starts with l or b and then have oo
  • [f-h]|[A-K]$ The last character should be f to h (capital or small)

grep

The grep command can search inside the files.

$ grep p friends 
payam
pedram
$

There are the most important switches:

switch meaning
-c just show the count
-v reverse the search
-n show line numbers
-l show only file names
-i case insensitive
$ grep p *
friends:payam
friends:pedram
what_I_have.txt:laptop    2
what_I_have.txt:pillow    5
what_I_have.txt:apple    2
$ grep p * -n
friends:12:payam
friends:15:pedram
what_I_have.txt:2:laptop    2
what_I_have.txt:3:pillow    5
what_I_have.txt:4:apple    2
$ grep p * -l
friends
what_I_have.txt
$ grep p * -c
friends:2
what_I_have.txt:3
$

If is very common to combine grep and find: find . -type f -print0 | xargs -0 grep -c a | grep -v ali # find all files with a in them but not ali`

extended grep

Extended grep is a GNU extension. It does not need the escaping and much easier. It can be used with -E option or egrep command which equals to grep -E.

Fixed grep

If you need to search the exact string (and not interpret it as a regex), use grep -F or fgrep so the fgrep this$ wont go for the end of the line and will find this$that too.

sed

In previous lessons we saw simple sed usage. Here I have great news for you: sed understands regex! If is good to use -r switch to tell sed that we are using them.

$ sed -r "s/^(a|b)/STARTS WITH A OR B/" friends 
STARTS WITH A OR Bmir
mina
jadi
STARTS WITH A OR Bita
STARTS WITH A OR Bli
hassan

Main switches:

switch meaning
-r use advanced regex
-n suppress output, you can use p at the end of your regex ( /something/p ) to print the output
$ sed -rn "/^(a|b)/p" friends 
amir
bita
ali

.

.

. .

.

.

.

.

.

.

.

.

.

results matching ""

    No results matching ""