Master NumPy Library for Data Analysis in Python in 10 Minutes
Learn and Become a Master of one of the most used python tools for Data Analysis.
NumPy is a python library used for working with arrays. It also has functions for working in domain of linear algebra, fourier transform, and matrices. It is an open source project and you can use it freely. NumPy stands for Numerical Python.
NumPy — Ndarray Object
The most important object defined in NumPy is an N-dimensional array type called ndarray. It describes the collection of items of the same type. Items in the collection can be accessed using a zero-based index.Every item in an ndarray takes the same size of block in the memory.
Each element in ndarray is an object of data-type object (called dtype).Any item extracted from ndarray object (by slicing) is represented by a Python object of one of array scalar types.
The following diagram shows a relationship between ndarray, data type object (dtype) and array scalar type −
It creates an ndarray from any object exposing array interface, or from any method that returns an array.
numpy.array(object, dtype = None, copy = True, order = None, subok = False, ndmin = 0)
The above constructor takes the following parameters −
- Object :- Any object exposing the array interface method returns an array, or any (nested) sequence.
- Dtype :- Desired data type of array, optional.
- Copy :- Optional. By default (true), the object is copied.
- Order :- C (row major) or F (column major) or A (any) (default).
- Subok :- By default, returned array forced to be a base class array. If true, sub-classes passed through.
ndmin :- Specifies minimum dimensions of resultant array.
Operations on Numpy Array
In this blog, we’ll walk through using NumPy to analyze data on wine quality. The data contains information on various attributes of wines, such as pH and fixed acidity, along with a quality score between 0 and 10 for each wine. The quality score is the average of at least 3 human taste testers. As we learn how to work with NumPy, we’ll try to figure out more about the perceived quality of wine.
The data was downloaded from the winequality-red.csv, and is available here. file, which we’ll be using throughout this tutorial: Lists Of Lists for CSV Data Before using NumPy, we’ll first try to work with the data using Python and the csv package. We can read in the file using the csv.reader object, which will allow us to read in and split up all the content from the csv file.
In the below code, we:
Import the csv library.
Open the winequality-red.csv file.
With the file open, create a new csv.reader object.
Pass in the keyword argument delimiter=";" to make sure that the records are split up on the semicolon character instead of the default comma character.
Call the list type to get all the rows from the file.
Assign the result to wines.
import numpy as np import csv with open(r'winequality-red.csv') as f: wines = list(csv.reader(f, delimiter=';')) # Once we’ve read in the data, we can print out the first 3 rows: print(wines[:3])
We can check the number of rows and columns in our data using the shape property of NumPy arrays:
import numpy as np import csv wines= np.genfromtxt(r'winequality-red.csv',delimiter=";",skip_header=1) print(wines) print("\n") # shape property x=wines.shape print(x)
Indexing NumPy Arrays
Let’s select the element at row 3 and column 4. In the below code, we pass in the index 2 as the row index, and the index 3 as the column index. This retrieves the value from the fourth column of the third row:
import numpy as np import csv wines= np.genfromtxt(r'winequality-red.csv',delimiter=";",skip_header=1) print(wines) print("\n") # Indexing NumPy Array y=wines[2,3] print(y)
1-Dimensional NumPy Arrays
So far, we’ve worked with 2-dimensional arrays, such as wines. However, NumPy is a package for working with multidimensional arrays. One of the most common types of multidimensional arrays is the 1-dimensional array, or vector.
Just like a list of lists is analogous to a 2-dimensional array, a single list is analogous to a 1-dimensional array. If we slice wines and only retrieve the third row, we get a 1-dimensional array:
We can retrieve individual elements from third_wine using a single index. The below code will display the second item in third_wine:
Most NumPy functions that we’ve worked with, such as numpy.random.rand, can be used with multidimensional arrays. Here’s how we’d use numpy.random.rand to generate a random vector:
import numpy as np import csv wines= np.genfromtxt(r'C:\Users\USer\Desktop\python\python_IA\winequality-red.csv',delimiter=";",skip_header=1) print(wines) print("\n") # slicing wines and only retrieve third row third_wine = wines[3,:] print('1. retrieve third row',third_wine) print("\n") # second item in third wine x=third_wine print('2. second item in third row = ',x) print("\n") # to use numpy.random.rand to generate a random vector y=np.random.rand(3) print('3. random vector generated is = ',y)
After successfully reading our dataset and learning about List, Indexing, & 1D array in NumPy we can start performing the operation on it.
The first element of each row is the fixed acidity, the second is the volatile ,acidity, and so on. We can find the average quality of the wines. The below code will:
- Extract the last element from each row after the header row.
- Convert each extracted element to a float.
- Assign all the extracted elements to the list qualities.
- Divide the sum of all the elements in qualities by the total number of elements in qualities to the get the mean
import numpy as np import csv with open(r'winequality-red.csv') as f: wines = list(csv.reader(f, delimiter=';')) print(wines[:3]) print("\n") qualities = [float(item[-1]) for item in wines[1:]] avg= sum(qualities) / len(qualities) print('average = ',avg)
NumPy Array Methods
In addition to the common mathematical operations, NumPy also has several methods that you can use for more complex calculations on arrays. An example of this is the numpy.ndarray.sum method. This finds the sum of all the elements in an array by default:
=> 2. Sum of alcohol content in all sample red wines
import numpy as np import csv wines= np.genfromtxt(r'C:\Users\USer\Desktop\python\python_IA\winequality-red.csv',delimiter=";",skip_header=1) print(wines) #task-1 sum of alcohol content in all sample red wines total_alcohol=wines[:,11].sum() print('sum = ',total_alcohol)
NumPy Array Comparisons
We get a Boolean array that tells us which of the wines have a quality rating greater than 5. We can do something similar with the other operators. For instance, we can see if any wines have a quality rating equal to 10:
=> 3. select wines having pH content > 5
import numpy as np import csv wines= np.genfromtxt(r'C:\Users\USer\Desktop\python\python_IA\winequality-red.csv',delimiter=";",skip_header=1) print(wines) # select wines having pH content > 5 ph_levels=wines[:,9]>5 print(ph_levels) print("\n")
We select only the rows where high_Quality contains a True value, and all of the columns. This subsetting makes it simple to filter arrays for certain criteria. For example, we can look for wines with a lot of alcohol and high quality. In order to specify multiple conditions, we have to place each condition in parentheses, and separate conditions with an ampersand (&):
=> 4. Select only wines where sulphates >10 and alcohol >7
import numpy as np import csv wines= np.genfromtxt(r'winequality-red.csv',delimiter=";",skip_header=1) print(wines) # Task-4 select only wines where sulphates > 10 and alcohol > 7 sulalc=(wines[:,10]>10) & (wines[:,11]>7) print(wines[sulalc,10:]) print("\n")
=> 5. select wine having pH greater than mean pH
import numpy as np import csv wines= np.genfromtxt(r'winequality-red.csv',delimiter=";",skip_header=1) print(wines) # select wine having pH greater than mean pH meanPH=(wines[:,9]>wines[:,9].mean()) print(wines[meanPH,9:])
We have seen what NumPy is, and some of its most basic uses. In the following posts we will see more complex functionalities and dig deeper into the workings of this fantastic library!
That is all, I hope you liked the post. Feel Free to follow me.