Regular Expressions in Python:

What is Regular Expression and how is it used?

Simply put, regular expression is a sequence of character(s) mainly used to find and replace patterns in a string or file. As I mentioned before, they are supported by most of the programming languages like python, perl, R, Java and many others. So, learning them helps in multiple ways (more on this later).

Regular expressions use two types of characters:

a) Meta characters: As the name suggests, these characters have a special meaning, similar to * in wild card.

b) Literals (like a,b,1,2…)

In Python, we have module “re” that helps with regular expressions. So you need to import library re before you can use regular expressions in Python.

Use this code --> Import re

The most common uses of regular expressions are:

  • Search a string (search and match)
  • Finding a string (findall)
  • Break string into a sub strings (split)
  • Replace part of a string (sub)

Let’s look at the methods that library “re” provides to perform these tasks.

 

What are various methods of Regular Expressions?

The ‘re’ package provides multiple methods to perform queries on an input string. Here are the most commonly used methods, I will discuss:

  1. re.match()
  2. re.search()
  3. re.findall()
  4. re.split()
  5. re.sub()
  6. re.compile()

Let’s look at them one by one.

 

re.match(patternstring):

This method finds match if it occurs at start of the string. For example, calling match() on the string ‘AV Analytics AV’ and looking for a pattern ‘AV’ will match. However, if we look for only Analytics, the pattern will not match. Let’s perform it in python now.

Code

import re
result = re.match(r'AV', 'AV Analytics Vidhya AV')
print result

Output:
<_sre.SRE_Match object at 0x0000000009BE4370>

Above, it shows that pattern match has been found. To print the matching string we’ll use method group (It helps to return the matching string). Use “r” at the start of the pattern string, it designates a python raw string.

result = re.match(r'AV', 'AV Analytics Vidhya AV')
print result.group(0)


Output:
AV

Let’s now find ‘Analytics’ in the given string. Here we see that string is not starting with ‘AV’ so it should return no match. Let’s see what we get:

Code

result = re.match(r'Analytics', 'AV Analytics Vidhya AV')
print result 


Output: 
None

There are methods like start() and end() to know the start and end position of matching pattern in the string.

Code

result = re.match(r'AV', 'AV Analytics Vidhya AV')
print result.start()
print result.end()

Output:
0
2

Above you can see that start and end position of matching pattern ‘AV’ in the string and sometime it helps a lot while performing manipulation with the string.

 

re.search(patternstring):

It is similar to match() but it doesn’t restrict us to find matches at the beginning of the string only. Unlike previous method, here searching for pattern ‘Analytics’ will return a match.

Code

result = re.search(r'Analytics', 'AV Analytics Vidhya AV')
print result.group(0)
Output:
Analytics

Here you can see that, search() method is able to find a pattern from any position of the string but it only returns the first occurrence of the search pattern.

 

re.findall (patternstring):

It helps to get a list of all matching patterns. It has no constraints of searching from start or end. If we will use method findall to search ‘AV’ in given string it will return both occurrence of AV. While searching a string, I would recommend you to use re.findall() always, it can work like re.search() and re.match() both.

Code

result = re.findall(r'AV', 'AV Analytics Vidhya AV')
print result

Output:
['AV', 'AV']



re.split(patternstring, [maxsplit=0]):

This methods helps to split string by the occurrences of given pattern.

Code

result=re.split(r'y','Analytics')
result

Output:
['Anal', 'tics']

Above, we have split the string “Analytics” by “y”. Method split() has another argument “maxsplit“. It has default value of zero. In this case it does the maximum splits that can be done, but if we give value to maxsplit, it will split the string. Let’s look at the example below:

Code

result=re.split(r'i','Analytics Vidhya')
print result

Output:
['Analyt', 'cs V', 'dhya'] #It has performed all the splits that can be done by pattern "i".

Code

result=re.split(r'i','Analytics Vidhya',maxsplit=1)
result

Output:
['Analyt', 'cs Vidhya']

Here, you can notice that we have fixed the maxsplit to 1. And the result is, it has only two values whereas first example has three values.

 

re.sub(patternreplstring):

It helps to search a pattern and replace with a new sub string. If the pattern is not found, string is returned unchanged.

Code

result=re.sub(r'India','the World','AV is largest Analytics community of India')
result
Output:
'AV is largest Analytics community of the World'



re.compile(patternreplstring):

We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it.

Code

import re
pattern=re.compile('AV')
result=pattern.findall('AV Analytics Vidhya AV')
print result
result2=pattern.findall('AV is largest analytics community of India')
print result2
Output:
['AV', 'AV']
['AV']


Quick Recap of various methods:

Till now,  we looked at various methods of regular expression using a constant pattern (fixed characters). But, what if we do not have a constant search pattern and we want to return specific set of characters (defined by a rule) from a string?  Don’t be intimidated.

This can easily be solved by defining an expression with the help of pattern operators (meta  and literal characters). Let’s look at the most common pattern operators.

What are the most commonly used operators?

Regular expressions can specify patterns, not just fixed characters. Here are the most commonly used operators that helps to generate an expression to represent required characters in a string or file. It is commonly used in web scrapping and  text mining to extract required information.

Operators Description
.  Matches with any single character except newline ‘\n’.
?  match 0 or 1 occurrence of the pattern to its left
+  1 or more occurrences of the pattern to its left
*  0 or more occurrences of the pattern to its left
\w  Matches with a alphanumeric character whereas \W (upper case W) matches non alphanumeric character.
\d   Matches with digits [0-9] and /D (upper case D) matches with non-digits.
\s  Matches with a single white space character (space, newline, return, tab, form) and \S (upper case S) matches any non-white space character.
\b  boundary between word and non-word and /B is opposite of /b
[..]  Matches any single character in a square bracket and [^..] matches any single character not in square bracket
\  It is used for special meaning characters like \. to match a period or \+ for plus sign.
^ and $  ^ and $ match the start or end of the string respectively
{n,m}  Matches at least n and at most m occurrences of preceding expression if we write it as {,m} then it will return at least any minimum occurrence to max m preceding expression.
a| b  Matches either a or b
( ) Groups regular expressions and returns matched text
\t, \n, \r  Matches tab, newline, return

For more details on  meta characters “(“, “)”,”|” and others details , you can refer this link (https://docs.python.org/2/library/re.html).

Now, let’s understand the pattern operators by looking at the below examples.

 

Some Examples of Regular Expressions

Problem 1: Return the first word of a given string

Solution-1  Extract each character (using “\w)

Code

import re
result=re.findall(r'.','AV is largest Analytics community of India')
print result

Output:
['A', 'V', ' ', 'i', 's', ' ', 'l', 'a', 'r', 'g', 'e', 's', 't', ' ', 'A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', ' ', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', ' ', 'o', 'f', ' ', 'I', 'n', 'd', 'i', 'a']

Above, space is also extracted, now to avoid it use “\w” instead of “.“.

 

Code

result=re.findall(r'\w','AV is largest Analytics community of India')
print result

Output:
['A', 'V', 'i', 's', 'l', 'a', 'r', 'g', 'e', 's', 't', 'A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', 'o', 'f', 'I', 'n', 'd', 'i', 'a']

 

Solution-2  Extract each word (using “*” or “+)

Code

result=re.findall(r'\w*','AV is largest Analytics community of India')
print result

Output:
['AV', '', 'is', '', 'largest', '', 'Analytics', '', 'community', '', 'of', '', 'India', '']

 

Again, it is returning space as a word because “*” returns zero or more matches of pattern to its left. Now to remove spaces we will go with “+“.

Code

result=re.findall(r'\w+','AV is largest Analytics community of India')
print result
Output:
['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']

Solution-3 Extract each word (using “^)

Code

result=re.findall(r'^\w+','AV is largest Analytics community of India')
print result

Output:
['AV']

If we will use “$” instead of “^”, it will return the word from the end of the string. Let’s look at it.

Code

result=re.findall(r'\w+$','AV is largest Analytics community of India')
print result
Output:
[‘India’]

 

Problem 2: Return the first two character of each word

Solution-1  Extract consecutive two characters of each word, excluding spaces (using “\w)
Code
result=re.findall(r'\w\w','AV is largest Analytics community of India')
print result

Output:
['AV', 'is', 'la', 'rg', 'es', 'An', 'al', 'yt', 'ic', 'co', 'mm', 'un', 'it', 'of', 'In', 'di']
Solution-2  Extract consecutive two characters those available at start of word boundary (using “\b)
Code
result=re.findall(r'\b\w.','AV is largest Analytics community of India')
print result

Output:
['AV', 'is', 'la', 'An', 'co', 'of', 'In']

Problem 3: Return the domain type of given email-ids

To explain it in simple manner, I will again go with a stepwise approach:

Solution-1  Extract all characters after “@”

Code

result=re.findall(r'@\w+','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz') 
print result 
Output: ['@gmail', '@test', '@analyticsvidhya', '@rest']

Above, you can see that “.com”, “.in” part is not extracted. To add it, we will go with below code.

result=re.findall(r'@\w+.\w+','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print result
Output:
['@gmail.com', '@test.in', '@analyticsvidhya.com', '@rest.biz']

Solution – 2 Extract only domain name using “( )”

Code

result=re.findall(r'@\w+.(\w+)','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print result
Output:
['com', 'in', 'com', 'biz']

Problem 4: Return date from given string

Here we will use “\d” to extract digit.

Solution:

Code

result=re.findall(r'\d{2}-\d{2}-\d{4}','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print result
Output:
['12-05-2007', '11-11-2011', '12-01-2009']

If you want to extract only year again parenthesis “( )” will help you.

Code


result=re.findall(r'\d{2}-\d{2}-(\d{4})','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print result
Output:
['2007', '2011', '2009']

Problem 5: Return all words of a string those starts with vowel

Solution-1  Return each words

Code

result=re.findall(r'\w+','AV is largest Analytics community of India')
print result

Output:
['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']

Solution-2  Return words starts with alphabets (using [])

Code

result=re.findall(r'[aeiouAEIOU]\w+','AV is largest Analytics community of India')
print result

Output:
['AV', 'is', 'argest', 'Analytics', 'ommunity', 'of', 'India']
Above you can see that it has returned “argest” and “ommunity” from the mid of words. To drop these two, we need to use “\b” for word boundary.
Solution- 3
Code
result=re.findall(r'\b[aeiouAEIOU]\w+','AV is largest Analytics community of India')
print result 

Output:
['AV', 'is', 'Analytics', 'of', 'India']

In similar ways, we can extract words those starts with constant using “^” within square bracket.

 

Code

result=re.findall(r'\b[^aeiouAEIOU]\w+','AV is largest Analytics community of India')
print result

Output:
[' is', ' largest', ' Analytics', ' community', ' of', ' India']
Above you can see that it has returned words starting with space. To drop it from output, include space in square bracket[].
Code
result=re.findall(r'\b[^aeiouAEIOU ]\w+','AV is largest Analytics community of India')
print result

Output:
['largest', 'community']


Problem 6: Validate a phone number (phone number must be of 10 digits and starts with 8 or 9) 

We have a list phone numbers in list “li” and here we will validate phone numbers using regular

Solution

Code

import re
li=['9999999999','999999-999','99999x9999']
for val in li:
 if re.match(r'[8-9]{1}[0-9]{9}',val) and len(val) == 10:
     print 'yes'
 else:
     print 'no'
Output:
yes
no
no

Problem 7: Split a string with multiple delimiters

Solution

Code

import re
line = 'asdf fjdk;afed,fjek,asdf,foo' # String has multiple delimiters (";",","," ").
result= re.split(r'[;,\s]', line)
print result

Output:
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

We can also use method re.sub() to replace these multiple delimiters with one as space ” “.

Code

import re
line = 'asdf fjdk;afed,fjek,asdf,foo'
result= re.sub(r'[;,\s]',' ', line)
print result

Output:
asdf fjdk afed fjek asdf foo

Problem 8: Retrieve Information from HTML file

I want to extract information from a HTML file (see below sample data). Here we need to extract information available between <td> and </td> except the first numerical index. I have assumed here that below html code is stored in a string str.

Sample HTML file (str)

<tr align="center"><td>1</td> <td>Noah</td> <td>Emma</td></tr>
<tr align="center"><td>2</td> <td>Liam</td> <td>Olivia</td></tr>
<tr align="center"><td>3</td> <td>Mason</td> <td>Sophia</td></tr>
<tr align="center"><td>4</td> <td>Jacob</td> <td>Isabella</td></tr>
<tr align="center"><td>5</td> <td>William</td> <td>Ava</td></tr>
<tr align="center"><td>6</td> <td>Ethan</td> <td>Mia</td></tr>
<tr align="center"><td>7</td> <td HTML>Michael</td> <td>Emily</td></tr>

Solution:

Code

result=re.findall(r'<td>\w+</td>\s<td>(\w+)</td>\s<td>(\w+)</td>',str)
print result
Output:
[('Noah', 'Emma'), ('Liam', 'Olivia'), ('Mason', 'Sophia'), ('Jacob', 'Isabella'), ('William', 'Ava'), ('Ethan', 'Mia'), ('Michael', 'Emily')]

You can read html file using library urllib2 (see below code).

Code

import urllib2
response = urllib2.urlopen('')
html = response.read()
Advertisements

Python Interview Questions:

1. What is Python?

Python is an interpreted, cooperative, object-oriented programming language. It embraces exceptions, modules, very high level dynamic data types,dynamic typing and classes. Python combines notable power with very strong syntax. It has interfaces to many system libraries and calls, as well as to various window systems, and is extensible in C++ or C. Python is also working as an extension language for applications that need a programmable interface. Lastly, Python is portable which means it runs on numerous Unix variants, OS/2, Mac, PCs under MS-DOS, Windows NT and Windows.

2. State some programming language features of Python?

Salient features of Python are:
Simple & Easy: Python is simple language & easy to learn.
Free/open source: it means everybody can use python without purchasing license.
High level language: when coding in Python one need not worry about low-level details.
Portable: Python codes are Machine & platform independent.
Extensible: Python program supports usage of C/ C++ codes.
Embedded Language: Python code can be embedded within C/C++ codes & can be used a scripting language.
Standard Library: Python standard library contains pre-written tools for programming.
Build-in Data Structure: contains lots of data structure like lists, numbers & dictionaries.

3. What is pickling and unpickling?

Pickle module accepts any Python object and converts it into a string representation and dumps it into a file by using dump function, this process is called pickling.  While the process of retrieving original Python objects from the stored string representation is called unpickling.

4. How Python is interpreted?

Python language is an interpreted language. Python program runs directly from the source code. It converts the source code that is written by the programmer into an intermediate language, which is again translated into machine language that has to be executed.

5. How memory is managed in Python?

Python memory is managed by Python private heap space. All Python objects and data structures are located in a private heap. The programmer does not have an access to this private heap and interpreter takes care of this Python private heap.

The allocation of Python heap space for Python objects is done by Python memory manager.  The core API gives access to some tools for the programmer to code.

Python also have an inbuilt garbage collector, which recycle all the unused memory and frees the memory and makes it available to the heap space.

6. What are the rules for local and global variables in Python?

If a variable is defined outside function then it is implicitly global. If variable is assigned new value inside the function means it is local. If we want to make it global we need to explicitly define it as global. Variable referenced inside the function are implicit global. Following code snippet will explain further the difference

#!/usr/bin/python
# Filename: variable_localglobal.py
def fun1(a):
print ‘a:’, a
a= 33;
print ‘local a: ‘, a
a = 100
fun1(a)
print ‘a outside fun1:’, a
def fun2():
global b
print ‘b: ‘, b
b = 33
print ‘global b:’, b
b =100
fun2()
print ‘b outside fun2′, b

——————————————————-

Output
$ python variable_localglobal.py
a: 100
local a: 33
a outside fun1: 100
b :100
global b: 33
b outside fun2: 33

7. What is LIST comprehensions features of Python used for?

LIST comprehensions features were introduced in Python version 2.0, it creates a new list based on existing list.

It maps a list into another list by applying a function to each of the elements of the existing list.

List comprehensions creates lists without using map() , filter() or lambda form.

8. How is memory managed in python?

  • Memory management in Python involves a private heap containing all Python objects and data structures. Interpreter takes care of Python heap and that the programmer has no access to it.
  • The allocation of heap space for Python objects is done by Python memory manager. The core API of Python provides some tools for the programmer to code reliable and more robust program.
  • Python also has a build-in garbage collector which recycles all the unused memory. When an object is no longer referenced by the program, the heap space it occupies can be freed. The garbage collector determines objects which are no longer referenced by the program frees the occupied memory and make it available to the heap space.
  • The gc module defines functions to enable /disable garbage collector:
    enable() -Enables automatic garbage collection.
    gc.disable() – Disables automatic garbage collection.

9. How do you make a higher order function in Python?

A higher-order function accepts one or more functions as input and returns a new function. Sometimes it is required to use function as data to make high order function , we need to import functools module.
The functools.partial() function is used often for high order function.

10. Describe how to generate random numbers in Python?

The standard module random implements a random number generator.

There are also many other in this module, such as:

uniform(a, b) returns a floating point number in the range [a, b].
randint(a, b)returns a random integer number in the range [a, b].
random()returns a floating point number in the range [0, 1].

Following code snippet show usage of all the three functions of module random:
Note: output of this code will be different evertime it is executed.

import random
i = random.randint(1,99)# i randomly initialized by integer between range 1 & 99
j= random.uniform(1,999)# j randomly initialized by float between range 1 & 999
k= random.random()# k randomly initialized by float between range 0 & 1
print(“i :” ,i)
print(“j :” ,j)
print(“k :” ,k)
__________
Output –
(‘i :’, 64)
(‘j :’, 701.85008797642115)
(‘k :’, 0.18173593240301023)

Output-
(‘i :’, 83)
(‘j :’, 56.817584548210945)
(‘k :’, 0.9946957743038618)

11. What are the tools that help to find bugs or perform static analysis?

To detect the bugs in Python source code and warns about the style and complexity of the bug,PyChecker, a static analysis tool is used.  Toverify whether the module meets the coding standard, Pylint is used.

12. What are Python decorators?

To make modifications to callable objects like functions, methods, or classes,Decorators are used. Decorators are a syntactic convenience that allows a Python source file to say what it is going to do with the result of a function or a class statement before rather than after the statement.

13. What is the difference between list and tuple?

Tuples are the lists in python which can’t be edited. As it is immutable , you create a tuple, you cannot edit it. On the other hand Lists are mutable, you can edit them, and they work like the array object in PHP or JavaScript. You can add items, delete items from a list; but you can’t do that to a tuple, tuples have a fixed size.

14. How are arguments passed by value or by reference?

In Python everything is object and all variables hold references to those objects. As references values are according to the functions, change the value of the references is not possible. Although, you can change the objects if it is mutable.

15. What is Dict and List comprehensions are?

They are syntax constructions to simplify the creation of a Dictionary or List based on remainingiterable.

16. What are the built-in type does python provides?

Mutable built-in types are as follows:

  • List
  • Sets
  • Dictionaries

Immutable built-in types are as follows:

  • Strings
  • Tuples
  • Numbers

17. What is namespace in Python?

A namespace is acharting from names to objects. Utmost namespaces are implemented as Python dictionaries. Examples of namespaces are: the set of built-in names, the global names in a component; and the local names in a function invocation. In a logic the group of attributes of an object also form a namespace.

18. What is lambda in Python?

To create small anonymous functions or functions without a name, the lambda operator or lambda function is used. These functions are called throw-away functions as they are just required where they have been generated. Lambda functions are mostly used in grouping with the functions reduce(),map() and filter().

19. Why lambda forms in python does not have statements?

A lambda form in python does not have statements because it is used to create new function object and then return them at runtime.

20. What is pass in Python?

When you do not want any command or code to execute but when a statement is required syntactically, pass is used. The pass statement is considered as a null operation; which means it will only execute and nothing happens during its execution.

21. In Python what are iterators?

To implement the iterator protocol, an iterator object is used. The iterator protocol contains of two methods. The __iter__() method must return the iterator object and the next() method returns the next element from a sequence.

22. What is unittest in Python?

Python’s unittest module, alsoknown as PyUnit, is created on the XUnit framework design. The similar pattern is reiterated in numerous other languages, which includes C, perl, Java, and Smalltalk. The framework applied by unittest supports test suites, fixtures and a test runner to empower automated testing of code.

23.In Python what is slicing?

To select a range of items from sequence types like list, tuple, strings etc., a mechanism known as slicing is used.

24. What are generators in Python?

Generators are a simple and dominantoption to create or to generate iterators. On the external they look like functions, but there is syntactical and a semantical difference between them. Instead of return statements you will find inside of the body of a generator only yield statements, i.e. one or more yield statements.

25. What is docstring in Python?

Python documentation strings (or docstrings) provide anappropriatemethod of relating documentation with Python modules,functions, methods,andclasses. An object’s docsting is well-defined by including a string constant as the first declaration in the object’s definition.

26. How can you copy an object in Python?

To copy an object in Python, trycopy.copy () or copy.deepcopy() methods for the overalluse. You cannot allow to copy all objects but still u can able to copy most of them.

27. How you can convert a number to a string?

str() an in built function is used to convert a number into a string. oct()  function is used for a octal representation while hex () function is used for a hexadecimal representation

28. What is the difference between Xrange and range?

Xrange and range are the preciselyequivalent in terms of functionality. They both offer a technique to create a list of integers to use. However the only difference between these two is that range returns a Python list object and Xrange returns an Xrange object.

29. What is module and package in Python?

A module is just a python file that can be used by importing it using the ‘import’ or ‘from module import var,function’ statements.A package is basically a way to organize code. Packages let you import directories on your computer into your programs using the ‘import’ or ‘from import’
statements.

 

A Article from : https://intellipaat.com/blog/python-interview-questions