Python/machine_learning/linear_regression.py

"""
Linear regression is the most basic type of regression commonly used for
predictive analysis. The idea is pretty simple: we have a dataset and we have
features associated with it. Features should be chosen very cautiously
as they determine how much our model will be able to make future predictions.
We try to set the weight of these features, over many iterations, so that they best
fit our dataset. In this particular code, I had used a CSGO dataset (ADR vs
Rating). We try to best fit a line through dataset and estimate the parameters.
"""
import numpy as np
import requests


def collect_dataset():
    """Collect dataset of CSGO
    The dataset contains ADR vs Rating of a Player
    :return : dataset obtained from the link, as matrix
    """
    response = requests.get(
        "https://raw.githubusercontent.com/yashLadha/"
        + "The_Math_of_Intelligence/master/Week1/ADRvs"
        + "Rating.csv"
    )
    lines = response.text.splitlines()
    data = []
    for item in lines:
        item = item.split(",")
        data.append(item)
    data.pop(0)  # This is for removing the labels from the list
    dataset = np.matrix(data)
    return dataset


def run_steep_gradient_descent(data_x, data_y, len_data, alpha, theta):
    """Run steep gradient descent and updates the Feature vector accordingly_
    :param data_x   : contains the dataset
    :param data_y   : contains the output associated with each data-entry
    :param len_data : length of the data_
    :param alpha    : Learning rate of the model
    :param theta    : Feature vector (weight's for our model)
    ;param return    : Updated Feature's, using
                       curr_features - alpha_ * gradient(w.r.t. feature)
    """
    n = len_data

    prod = np.dot(theta, data_x.transpose())
    prod -= data_y.transpose()
    sum_grad = np.dot(prod, data_x)
    theta = theta - (alpha / n) * sum_grad
    return theta


def sum_of_square_error(data_x, data_y, len_data, theta):
    """Return sum of square error for error calculation
    :param data_x    : contains our dataset
    :param data_y    : contains the output (result vector)
    :param len_data  : len of the dataset
    :param theta     : contains the feature vector
    :return          : sum of square error computed from given feature's
    """
    prod = np.dot(theta, data_x.transpose())
    prod -= data_y.transpose()
    sum_elem = np.sum(np.square(prod))
    error = sum_elem / (2 * len_data)
    return error


def run_linear_regression(data_x, data_y):
    """Implement Linear regression over the dataset
    :param data_x  : contains our dataset
    :param data_y  : contains the output (result vector)
    :return        : feature for line of best fit (Feature vector)
    """
    iterations = 100000
    alpha = 0.0001550

    no_features = data_x.shape[1]
    len_data = data_x.shape[0] - 1

    theta = np.zeros((1, no_features))

    for i in range(0, iterations):
        theta = run_steep_gradient_descent(data_x, data_y, len_data, alpha, theta)
        error = sum_of_square_error(data_x, data_y, len_data, theta)
        print("At Iteration %d - Error is %.5f " % (i + 1, error))

    return theta


def main():
    """Driver function"""
    data = collect_dataset()

    len_data = data.shape[0]
    data_x = np.c_[np.ones(len_data), data[:, :-1]].astype(float)
    data_y = data[:, -1].astype(float)

    theta = run_linear_regression(data_x, data_y)
    len_result = theta.shape[1]
    print("Resultant Feature vector : ")
    for i in range(0, len_result):
        print(f"{theta[0, i]:.5f}")


if __name__ == "__main__":
    main()
Added Linear regression 2017-06-27 12:26:27 +00:00			`"""`
			`Linear regression is the most basic type of regression commonly used for`
contribution guidelines checks (#1787) * spelling corrections * review * improved documentation, removed redundant variables, added testing * added type hint * camel case to snake case * spelling fix * review * python --> Python # it is a brand name, not a snake * explicit cast to int * spaces in int list * "!= None" to "is not None" * Update comb_sort.py * various spelling corrections in documentation & several variables naming conventions fix * + char in file name * import dependency - bug fix Co-authored-by: John Law <johnlaw.po@gmail.com> 2020-03-04 12:40:28 +00:00			`predictive analysis. The idea is pretty simple: we have a dataset and we have`
			`features associated with it. Features should be chosen very cautiously`
			`as they determine how much our model will be able to make future predictions.`
			`We try to set the weight of these features, over many iterations, so that they best`
			`fit our dataset. In this particular code, I had used a CSGO dataset (ADR vs`
Added Linear regression 2017-06-27 12:26:27 +00:00			`Rating). We try to best fit a line through dataset and estimate the parameters.`
			`"""`
			`import numpy as np`
isort --profile black . (#2181) * updating DIRECTORY.md * isort --profile black . * Black after * updating DIRECTORY.md Co-authored-by: github-actions <${GITHUB_ACTOR}@users.noreply.github.com> 2020-07-06 07:44:19 +00:00			`import requests`
Added Linear regression 2017-06-27 12:26:27 +00:00

			`def collect_dataset():`
Optimized recursive_bubble_sort (#2410) * optimized recursive_bubble_sort * Fixed doctest error due whitespace * reduce loop times for optimization * fixup! Format Python code with psf/black push Co-authored-by: github-actions <${GITHUB_ACTOR}@users.noreply.github.com> 2020-09-10 08:31:26 +00:00			`"""Collect dataset of CSGO`
Added Linear regression 2017-06-27 12:26:27 +00:00			`The dataset contains ADR vs Rating of a Player`
			`:return : dataset obtained from the link, as matrix`
			`"""`
psf/black code formatting (#1277) 2019-10-05 05:14:13 +00:00			`response = requests.get(`
			`"https://raw.githubusercontent.com/yashLadha/"`
			`+ "The_Math_of_Intelligence/master/Week1/ADRvs"`
			`+ "Rating.csv"`
			`)`
Added Linear regression 2017-06-27 12:26:27 +00:00			`lines = response.text.splitlines()`
			`data = []`
			`for item in lines:`
psf/black code formatting (#1277) 2019-10-05 05:14:13 +00:00			`item = item.split(",")`
Added Linear regression 2017-06-27 12:26:27 +00:00			`data.append(item)`
			`data.pop(0) # This is for removing the labels from the list`
			`dataset = np.matrix(data)`
			`return dataset`


psf/black code formatting (#1277) 2019-10-05 05:14:13 +00:00			`def run_steep_gradient_descent(data_x, data_y, len_data, alpha, theta):`
Optimized recursive_bubble_sort (#2410) * optimized recursive_bubble_sort * Fixed doctest error due whitespace * reduce loop times for optimization * fixup! Format Python code with psf/black push Co-authored-by: github-actions <${GITHUB_ACTOR}@users.noreply.github.com> 2020-09-10 08:31:26 +00:00			`"""Run steep gradient descent and updates the Feature vector accordingly_`
Added Linear regression 2017-06-27 12:26:27 +00:00			`:param data_x : contains the dataset`
			`:param data_y : contains the output associated with each data-entry`
			`:param len_data : length of the data_`
			`:param alpha : Learning rate of the model`
			`:param theta : Feature vector (weight's for our model)`
			`;param return : Updated Feature's, using`
			`curr_features - alpha_ * gradient(w.r.t. feature)`
			`"""`
			`n = len_data`

			`prod = np.dot(theta, data_x.transpose())`
			`prod -= data_y.transpose()`
			`sum_grad = np.dot(prod, data_x)`
			`theta = theta - (alpha / n) * sum_grad`
			`return theta`


			`def sum_of_square_error(data_x, data_y, len_data, theta):`
Optimized recursive_bubble_sort (#2410) * optimized recursive_bubble_sort * Fixed doctest error due whitespace * reduce loop times for optimization * fixup! Format Python code with psf/black push Co-authored-by: github-actions <${GITHUB_ACTOR}@users.noreply.github.com> 2020-09-10 08:31:26 +00:00			`"""Return sum of square error for error calculation`
Added Linear regression 2017-06-27 12:26:27 +00:00			`:param data_x : contains our dataset`
			`:param data_y : contains the output (result vector)`
			`:param len_data : len of the dataset`
			`:param theta : contains the feature vector`
			`:return : sum of square error computed from given feature's`
			`"""`
			`prod = np.dot(theta, data_x.transpose())`
			`prod -= data_y.transpose()`
			`sum_elem = np.sum(np.square(prod))`
			`error = sum_elem / (2 * len_data)`
			`return error`


			`def run_linear_regression(data_x, data_y):`
Optimized recursive_bubble_sort (#2410) * optimized recursive_bubble_sort * Fixed doctest error due whitespace * reduce loop times for optimization * fixup! Format Python code with psf/black push Co-authored-by: github-actions <${GITHUB_ACTOR}@users.noreply.github.com> 2020-09-10 08:31:26 +00:00			`"""Implement Linear regression over the dataset`
Added Linear regression 2017-06-27 12:26:27 +00:00			`:param data_x : contains our dataset`
			`:param data_y : contains the output (result vector)`
			`:return : feature for line of best fit (Feature vector)`
			`"""`
			`iterations = 100000`
			`alpha = 0.0001550`

			`no_features = data_x.shape[1]`
			`len_data = data_x.shape[0] - 1`

			`theta = np.zeros((1, no_features))`

			`for i in range(0, iterations):`
psf/black code formatting (#1277) 2019-10-05 05:14:13 +00:00			`theta = run_steep_gradient_descent(data_x, data_y, len_data, alpha, theta)`
Added Linear regression 2017-06-27 12:26:27 +00:00			`error = sum_of_square_error(data_x, data_y, len_data, theta)`
psf/black code formatting (#1277) 2019-10-05 05:14:13 +00:00			`print("At Iteration %d - Error is %.5f " % (i + 1, error))`
Added Linear regression 2017-06-27 12:26:27 +00:00
			`return theta`


			`def main():`
fix(ci): Update pre-commit hooks and apply new black (#4359) * fix(ci): Update pre-commit hooks and apply new black * remove empty docstring 2021-04-26 05:46:50 +00:00			`"""Driver function"""`
Added Linear regression 2017-06-27 12:26:27 +00:00			`data = collect_dataset()`

			`len_data = data.shape[0]`
			`data_x = np.c_[np.ones(len_data), data[:, :-1]].astype(float)`
			`data_y = data[:, -1].astype(float)`

			`theta = run_linear_regression(data_x, data_y)`
			`len_result = theta.shape[1]`
psf/black code formatting (#1277) 2019-10-05 05:14:13 +00:00			`print("Resultant Feature vector : ")`
Added Linear regression 2017-06-27 12:26:27 +00:00			`for i in range(0, len_result):`
MAINT: Updated f-string method (#6230) * MAINT: Used f-string method Updated the code with f-string methods wherever required for a better and cleaner understanding of the code. * Updated files with f-string method * Update rsa_key_generator.py * Update rsa_key_generator.py * Update elgamal_key_generator.py * Update lru_cache.py I don't think this change is efficient but it might tackle the error as the error was due to using long character lines. * Update lru_cache.py * Update lru_cache.py Co-authored-by: cyai <seriesscar@gmail.com> Co-authored-by: Christian Clauss <cclauss@me.com> 2022-07-07 14:34:07 +00:00			`print(f"{theta[0, i]:.5f}")`
Added Linear regression 2017-06-27 12:26:27 +00:00

psf/black code formatting (#1277) 2019-10-05 05:14:13 +00:00			`if __name__ == "__main__":`
Added Linear regression 2017-06-27 12:26:27 +00:00			`main()`