Our approach to this model is to discover key variables that can help accurately determine sale price of a home in King County. Our first task will be to see what sorts of questions we can answer to solve business problems with data that is present in the raw data. Secondly, we will clean the data, eliminating any features that do not help solve our business propositions and adding any that can give better insight into what factors affect price. Once we have our initial model we will tweak features to fit our model to our split training data to later test on the test data for more accurate predictions. Our end goal would be to minimize error as much as possible, provide insights on what affects price, and develop a strategy around solving business questions posed.
# import modules needed for data analysis and get them ready for use in the notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import OneHotEncoder
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
from geopy import distance
import math
from sklearn.linear_model import LassoCV
ss = StandardScaler()
lr = LinearRegression()
lscv = LassoCV(max_iter=150000)
raw_data = pd.read_csv("kc_house_data.csv")
drop_raw = raw_data.drop(['id', 'date', 'condition',
'sqft_above', 'sqft_basement', 'yr_renovated',
'sqft_living15', 'sqft_lot15'], axis=1)
clean_data = drop_raw[drop_raw['bedrooms'] != 33].copy()
clean_data = clean_data.fillna(0)
#Make a new column to of the longitude and latitude to help determine businesses
clean_data['lat_and_long'] = list(zip(clean_data['lat'], clean_data['long']))
finder = clean_data[clean_data['zipcode']==98005]
geopy is y, x
avg_lat_98005 = np.mean(finder['lat'])
avg_long_98005 = np.mean(finder['long'])
(avg_lat_98005, avg_long_98005)
#center of Bellevue based on zipcodes
#Denote points of interest based on lat and long.
mid_of_bellevue = (avg_lat_98005, avg_long_98005)
seattle_lat_long = (47.6062, -122.3321)
airport_lat_long = (47.4502, -122.3088)
snoq_falls_lat_long = (47.5417, -121.8377)
vancouver_lat_long = (49.2827, -123.1207)
mt_rain_lat_long = (46.8523, -121.7603)
oly_lat_long = (47.8021, -123.6044)
tacoma_lat_long = (47.2529, -122.4443)
stevens_lat_long = (47.7448, -121.0890)
#iterate the distance of each sold house from points of interest
distances_col = [distance.distance(elem, mid_of_bellevue).miles for elem in clean_data['lat_and_long']]
seattle_distances_col = [distance.distance(elem, seattle_lat_long).miles for elem in clean_data['lat_and_long']]
seatac_distances_col = [distance.distance(elem, airport_lat_long).miles for elem in clean_data['lat_and_long']]
snoq_falls_dist_col = [distance.distance(elem, snoq_falls_lat_long).miles for elem in clean_data['lat_and_long']]
vanc_dist_col = [distance.distance(elem, vancouver_lat_long).miles for elem in clean_data['lat_and_long']]
mt_rain_dist_col = [distance.distance(elem, mt_rain_lat_long).miles for elem in clean_data['lat_and_long']]
oly_dist_col = [distance.distance(elem, oly_lat_long).miles for elem in clean_data['lat_and_long']]
tacoma_dist_col = [distance.distance(elem, tacoma_lat_long).miles for elem in clean_data['lat_and_long']]
stevens_dist_col = [distance.distance(elem, stevens_lat_long).miles for elem in clean_data['lat_and_long']]
#Create new columns based on the distances developed above
clean_data['dist_from_bellevue'] = distances_col
clean_data['dist_from_seattle'] = seattle_distances_col
clean_data['dist_from_seatac'] = seatac_distances_col
clean_data['dist_from_snoq_falls'] = snoq_falls_dist_col
clean_data['dist_from_vancouver'] = vanc_dist_col
clean_data['dist_from_mt_rain'] = mt_rain_dist_col
clean_data['dist_from_oly'] = oly_dist_col
clean_data['dist_from_tacoma'] = tacoma_dist_col
clean_data['dist_from_stevens'] = stevens_dist_col
clean_data['sqft_living'] = np.log(clean_data['sqft_living'])
clean_data['beds_and_baths'] = np.log(clean_data['bedrooms'] * clean_data['bathrooms'])
clean_data['sqft_living_div_floors_div_sqft_lot'] = (clean_data['sqft_living']/clean_data['floors'])/clean_data['sqft_lot']
clean_data['squared_living'] = np.log(np.square(clean_data['sqft_living']))
clean_data['squared_lot'] = np.square(clean_data['sqft_lot'])
clean_data['sqft_divby_bedroom'] = np.log(clean_data['sqft_living']/clean_data['bedrooms'])
clean_data['dist_seatac_seattle'] = (clean_data['dist_from_seatac'] + clean_data['dist_from_seattle'])/2
clean_data['dist_seatac_bellevue'] = (clean_data['dist_from_seatac'] + clean_data['dist_from_bellevue'])/2
clean_data['square_dist_seatac'] = np.square(clean_data['dist_from_seatac'])
clean_data['square_seatac_bellevue'] = np.square(clean_data['dist_seatac_bellevue'])
clean_data['sqft_times_grade'] = np.log(clean_data['sqft_living'] * clean_data['grade'])
clean_data['age'] = 2019 - clean_data['yr_built']
clean_data['sq_age'] = np.square(clean_data['age'])
clean_data['water_weight'] = np.log((1+clean_data['waterfront']) * clean_data['sqft_living'])
clean_data['view_weight'] = np.log((1+clean_data['view']) * clean_data['sqft_living'])
clean_data_dist = clean_data.drop(['lat', 'long', 'lat_and_long'], axis=1).copy()
clean_data_dist.drop('zipcode', axis=1).corr().loc[['price']].T
#See coefficients for all columns
ohe = OneHotEncoder(drop='first', categories='auto')
price_zip_trans = ohe.fit_transform(clean_data_dist['zipcode'].values.reshape(-1,1))
zip_sparse = pd.DataFrame(price_zip_trans.todense(), columns=ohe.get_feature_names())
#Import one hot encode and then use it to create individual zipcodes as columns
clean_data_dist_no_zip = clean_data_dist.drop(['zipcode'], axis=1).copy()
#remove zipcodes from data as we now have each individual zipcode that represents each record
clean_data_dist_no_zip['log_price'] = np.log(clean_data_dist_no_zip['price'])
clean_data_dist_no_zip = clean_data_dist_no_zip.drop(['price', 'waterfront', 'sqft_living_div_floors_div_sqft_lot','squared_lot', 'age', 'sq_age'], axis=1)
model_data = zip_sparse.join(clean_data_dist_no_zip, how='inner')
#model_data.head()
#Define the clean data we will use to model and join it with the matrix of zipcodes so we have
#each variable in one place.
X = model_data.drop('log_price', axis=1)
y = model_data['log_price']
predictors = sm.add_constant(X)
model_stats = sm.OLS(y, predictors).fit()
model_stats.summary()
X_train, X_test, y_train, y_test = train_test_split(X,
y,
random_state=10)
ss.fit(X_train)
X_train_sc = ss.transform(X_train)
X_test_sc = ss.transform(X_test)
lr.fit(X_train_sc, y_train)
lr.score(X_test_sc, y_test)
#list(zip(lr.coef_, X_train.columns))
price_predict = lr.predict(X_test_sc)
y_test_non_log = np.exp(y_test)
price_predict_non_log = np.exp(price_predict)
metrics.mean_squared_error(y_test_non_log, price_predict_non_log)
math.sqrt(metrics.mean_squared_error(y_test_non_log, price_predict_non_log))
lscv.fit(X_train_sc, y_train)
lscv.score(X_test_sc, y_test)
price_predict2 = lscv.predict(X_test_sc)
price_predict2_non_log = np.exp(price_predict2)
math.sqrt(metrics.mean_squared_error(y_test_non_log, price_predict2_non_log))
lscv.alpha_
non_log_prices = np.exp(model_data['log_price'])
non_log_prices
compiled_data = model_data.copy()
compiled_data['nonlog_prices'] = non_log_prices
colus=ohe.get_feature_names()
to_exclude = list(colus)
to_exclude.append('log_price')
to_exclude.append('nonlog_prices')
#to_exclude
for feature in compiled_data.drop(to_exclude, axis=1).columns:
compiled_data.plot(x=feature, y='log_price', kind='scatter', title=f"{feature}", figsize=(12,10));
plt.savefig(f"{feature} with log(price).png")
#multiple plots to visualize our log and actual prices versus our various variables
for feature in compiled_data.drop(to_exclude, axis=1).columns:
compiled_data.plot(x=feature, y='nonlog_prices', kind='scatter', title=f"{feature}", figsize=(12,10));
plt.savefig(f"{feature} with price.png")
lscv.coef_
list(zip(lscv.coef_, X_train.columns))
Distance from both Seatac Airport and Vancouver have high coefficients. This could imply that people either need to be close for work or pleasure, contrarily this could also just show a higher density of houses that have minimized distance to those locations. We would need to explore further on the correlation of those features. Obviously the square feet of a home is a major factor in prices of homes and using that coupled with other factors that were on a scale helped the model learn(i.e. grade).
Our model, though not perfect, helps show the various factors that can affect house price in King County. Our root mean square error on that price is just over $130k for both the linear regression and through the lasso function meaning we can expect a variance in price to that amount when predicting a sale price for a home. By using data that takes into account data that spans more years, has more data from the more rural zipcodes for King County, and/or potentially grouping zipcodes into bins to help see the price based on an area versus arbitrary zipcodes may help make the model better. Our presentation mentions some next steps we would take given the time and tools to help improve the model but feel through our current iteration we could certainly make some business decisions based on our findings.