OCS353 Data Science Fundamentals Syllabus:
OCS353 Data Science Fundamentals Syllabus – Anna University Regulation 2021
COURSE OBJECTIVES:
● Familiarize students with the data science process.
● Understand the data manipulation functions in Numpy and Pandas.
● Explore different types of machine learning approaches.
● Understand and practice visualization techniques using tools.
● Learn to handle large volumes of data with case studies.
UNIT I INTRODUCTION
Data Science: Benefits and uses – facets of data – Data Science Process: Overview – Defining research goals – Retrieving data – data preparation – Exploratory Data analysis – build the model – presenting findings and building applications – Data Mining – Data Warehousing – Basic statistical descriptions of Data
UNIT II DATA MANIPULATION
Python Shell – Jupyter Notebook – IPython Magic Commands – NumPy Arrays-Universal Functions – Aggregations – Computation on Arrays – Fancy Indexing – Sorting arrays – Structured data – Data manipulation with Pandas – Data Indexing and Selection – Handling missing data – Hierarchical indexing – Combining datasets – Aggregation and Grouping – String operations – Working with time series – High performance
UNIT III MACHINE LEARNING
The modeling process – Types of machine learning – Supervised learning – Unsupervised learning – Semi-supervised learning- Classification, regression – Clustering – Outliers and Outlier Analysis
UNIT IV DATA VISUALIZATION
Importing Matplotlib – Simple line plots – Simple scatter plots – visualizing errors – density and contour plots – Histograms – legends – colors – subplots – text and annotation – customization – three dimensional plotting – Geographic Data with Basemap – Visualization with Seaborn
UNIT V HANDLING LARGE DATA
Problems – techniques for handling large volumes of data – programming tips for dealing with large data sets- Case studies: Predicting malicious URLs, Building a recommender system – Tools and techniques needed – Research question – Data preparation – Model building – Presentation and automation.
30 PERIODS
PRACTICAL EXERCISES: 30 PERIODS
LAB EXERCISES
1. Download, install and explore the features of Python for data analytics.
2. Working with Numpy arrays
3. Working with Pandas data frames
4. Basic plots using Matplotlib
5. Statistical and Probability measures
a) Frequency distributions
b) Mean, Mode, Standard Deviation
c) Variability
d) Normal curves
e) Correlation and scatter plots
f) Correlation coefficient
g) Regression
6. Use the standard benchmark data set for performing the following:
a) Univariate Analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation, Skewness and Kurtosis.
b) Bivariate Analysis: Linear and logistic regression modelling.
7. Apply supervised learning algorithms and unsupervised learning algorithms on any data set.
8. Apply and explore various plotting functions on any data set.
Note: Example data sets like: UCI, Iris, Pima Indians Diabetes etc.
COURSE OUTCOMES:
At the end of this course, the students will be able to:
CO1: Gain knowledge on data science process.
CO2: Perform data manipulation functions using Numpy and Pandas.
CO3 Understand different types of machine learning approaches.
CO4: Perform data visualization using tools.
CO5: Handle large volumes of data in practical scenarios.
TOTAL:60 PERIODS
TEXT BOOKS
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”, Manning Publications, 2016.
2. Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016.
REFERENCES
1. Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.
2. Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”, Green Tea Press, 2014.
