Pandas:
Pandas is a powerful open-source data analysis and manipulation library for Python. It provides data structures and functions needed to efficiently manipulate large datasets. It is built on top of NumPy and is widely used in data science, machine learning, and analytics.
1. Installation Setup
To install pandas, use:
pip install pandas
Or with Anaconda:
conda install pandas
To check if pandas is installed and its version:
import pandas as pd
print(pd.__version__)
2. Pandas Data Structures
Pandas provides two primary data structures:
- Series – A one-dimensional labeled array.
- DataFrame – A two-dimensional table with labeled axes (rows and columns).
2.1 Series
A Series is similar to a column in a spreadsheet. It consists of data and an index.
import pandas as pd
data = [10, 20, 30, 40]
series = pd.Series(data)
print(series)
Output:
0 10
1 20
2 30
3 40
dtype: int64
You can also specify custom index labels:
series = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(series)
2.2 DataFrame
A DataFrame is a table-like structure containing multiple columns.
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
print(df)
3. Reading and Writing Data
3.1 Reading CSV
df = pd.read_csv('sampledata.csv')
print(df.head())
3.2 Writing CSV
df.to_csv('output.csv', index=False)
4. Data Selection and Filtering
4.1 Selecting Columns
print(df['Name'])
print(df[['Name', 'Salary']])
4.2 Selecting Rows
print(df.iloc[0])
print(df.loc[1])
4.3 Filtering Data
filtered_df = df[df['Age'] > 25]
print(filtered_df)
5. Data Cleaning
5.1 Handling Missing Values
df.fillna(0)
df.dropna()
5.2 Renaming Columns
df.rename(columns={'Name': 'Employee_Name'}, inplace=True)
6. Data Aggregation
6.1 Grouping Data
grouped_df = df.groupby('Age').sum()
print(grouped_df)
6.2 Applying Functions
df['Salary'] = df['Salary'].apply(lambda x: x * 1.1)
7. Visualization with Pandas
Pandas integrates well with Matplotlib for visualization.
import matplotlib.pyplot as plt
df.plot(kind='bar', x='Name', y='Salary')
plt.show()
Conclusion
Pandas is an essential tool for working with structured data in Python. It allows easy manipulation, cleaning, and visualization of data. Mastering pandas helps in efficient data analysis and reporting.