Pandas:

Pandas is a powerful open-source data analysis and manipulation library for Python. It provides data structures and functions needed to efficiently manipulate large datasets. It is built on top of NumPy and is widely used in data science, machine learning, and analytics.

1. Installation Setup

To install pandas, use:

pip install pandas

Or with Anaconda:

conda install pandas

To check if pandas is installed and its version:

import pandas as pd
print(pd.__version__)

2. Pandas Data Structures

Pandas provides two primary data structures:

  • Series – A one-dimensional labeled array.
  • DataFrame – A two-dimensional table with labeled axes (rows and columns).

2.1 Series

A Series is similar to a column in a spreadsheet. It consists of data and an index.

import pandas as pd
data = [10, 20, 30, 40]
series = pd.Series(data)
print(series)

Output:

0    10
1    20
2    30
3    40
dtype: int64

You can also specify custom index labels:

series = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(series)

2.2 DataFrame

A DataFrame is a table-like structure containing multiple columns.

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
print(df)

3. Reading and Writing Data

3.1 Reading CSV

df = pd.read_csv('sampledata.csv')
print(df.head())

3.2 Writing CSV

df.to_csv('output.csv', index=False)

4. Data Selection and Filtering

4.1 Selecting Columns

print(df['Name'])
print(df[['Name', 'Salary']])

4.2 Selecting Rows

print(df.iloc[0])
print(df.loc[1])

4.3 Filtering Data

filtered_df = df[df['Age'] > 25]
print(filtered_df)

5. Data Cleaning

5.1 Handling Missing Values

df.fillna(0)
df.dropna()

5.2 Renaming Columns

df.rename(columns={'Name': 'Employee_Name'}, inplace=True)

6. Data Aggregation

6.1 Grouping Data

grouped_df = df.groupby('Age').sum()
print(grouped_df)

6.2 Applying Functions

df['Salary'] = df['Salary'].apply(lambda x: x * 1.1)

7. Visualization with Pandas

Pandas integrates well with Matplotlib for visualization.

import matplotlib.pyplot as plt
df.plot(kind='bar', x='Name', y='Salary')
plt.show()

Conclusion

Pandas is an essential tool for working with structured data in Python. It allows easy manipulation, cleaning, and visualization of data. Mastering pandas helps in efficient data analysis and reporting.