In this notebook will be performed the analysis of the craigslist dataset, a well-known American portal that hosts ads dedicated to jobs, events, shopping, dating and other services.
The dataset was created via web scraping by the author himself.
### Keeping safe from any disaster :) ###
%autosave 30
For these analyses we used some libraries covered during the course and partly others based on their documentation. Below is the list:
plotly
seaborn
### Imports ###
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly
import seaborn as sns
import matplotlib.pyplot as plt
### Reading .csv dataset ###
df = pd.read_csv("resources/dataset/vehicles.csv")
df.head(1)
df.columns
df.info()
Attempting to launch the following cell we immediately notice that the computation time is very high.
# plt.figure(figsize=(27,12))
# plt.title("Missing values for each column")
# dfT=df.isnull().transpose()
# fig=sns.heatmap(dfT)
# plt.savefig('density.png', dpi=300)
The considered dataset needs rather onerous computational times for certain operations, therefore, we tried to circumvent the problem by initially considering only the observations recorded between 2016 and 2019.
This was not a very efficient approach: on the one hand the computational time had been reduced, but on the other hand we lost important data, reducing the value of the collection.
### No more used ###
### Filtering more recent occurrences ###
# mask=(df["year"]>2015.0) & (df["year"]<2020)
### Applying the mask ###
# dfByYear=df[mask]
### Convert dataframe object to .csv file ###
# dfByYear.to_csv("resources/dataset/filteredByYear.csv", index=False)
### Reading from filesystem .csv dataset ###
#dfByYear = pd.read_csv("resources/dataset/filteredByYear.csv")
We therefore decided to consider 80'000 samples for the longer calculations.
### Dataframe sampling ###
dfs=df.sample(n=80000)
plt.figure(figsize=(27,12))
plt.title("Missing values for each column")
dfT=dfs.isnull().transpose()
fig=sns.heatmap(dfT)
plt.savefig('density.png', dpi=300)
### Dropping for semplicity useless columns ###
df = df.drop(columns=["url","region_url","image_url","county"])
The first obvious question we asked ourselves is what is the best selling brand on Craiglist.
We expected as an answer neither a luxury brand nor a low quality one, otherwise no one would buy it.
### Create a new grouped by manufacturer dataframe object###
occurrenciesByManufacturer = dfs.groupby(["manufacturer"]).size().reset_index(name="counts")
### Getting relative frequencies ###
occurrenciesByManufacturer["relative_freq"] = occurrenciesByManufacturer["counts"] /\
occurrenciesByManufacturer["counts"].sum()
### Sorting dataframe by number of occurrences for each manufacturer ###
occurrenciesByManufacturer = occurrenciesByManufacturer.sort_values(
"counts", ascending=True)
### Giving our char a name ###
title = "Number of traded cars for each manufacturer"
### Creating the bar chart ###
fig = px.bar(occurrenciesByManufacturer, y="manufacturer",
x="counts", text="counts", title=title, height=1000)
### Formatting values labels ###
fig.update_traces(texttemplate="%{text:.2s}", textposition="outside")
### Show ###
fig.show()
# save as html
#plotly.offline.plot(fig, filename=f"{figTitle}.html")
The causes are mainly as follows:
The answer, in our not-so-predictable opinion, is the pickup.
In fact, this type of vehicle is much bought and sold in the American territory. Even in the movies it is seen as the typical car used by the good family man.
Some of the most relevant reasons:
### Considering only for vehicles ###
result=df.loc[df["manufacturer"]=="ford", 'model']
### Create a new dataframe instance based on the previous mask ###
resultDf=pd.DataFrame(result)
### Creating the histogram ###
px.histogram(resultDf, barmode="overlay", y="model", height=1000)
### Taking some model labels and values
modelLabels = df[df["manufacturer"]=="ford"].model.value_counts().head(10).index
modelValues = df[df["manufacturer"]=="ford"].model.value_counts().head(10).values
### Making place for other traces ###
fig = make_subplots(rows=1, cols=1, specs=[[{"type":"domain"}]])
### Adding the new trace###
fig.add_trace(go.Pie(labels=modelLabels, values=modelValues, title="Ford models"),
1, 1)
### Show ###
fig.show()
### Optionally save it as .html/.png file ###
#plotly.offline.plot(fig, filename=f"{title}.html")
#fig.write_image(f"{title}.png")
Initially, we thought that pickups were present more in arid/off-road areas and less in urban areas.
However, the following chart shows, on the contrary, how the distribution of pickups is dense especially in large cities.
### Cleaning from non-specified values ###
df=df.dropna(subset=["manufacturer"])
### Creating first trace ###
fig=px.scatter_mapbox(df[df["type"]=="pickup"],
lat="lat", lon="long", hover_name="manufacturer",
hover_data=["odometer", "price", "state"], color_discrete_sequence=["blue"],
zoom=4, height=600)
### Customizing map ###
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
### Show ###
fig.show()
### Optionally save it as .html/.png file ###
#title="Geographical position of pickups"
#plotly.offline.plot(fig, filename=f"{title}.html")
#fig.write_image(f"{title}.png")
While comparing the location of all pickups with that of sedans, thus vehicles that tend to be used in urban areas, we did not find that there is a possible logical pattern that would meet our expectations.
### Cleaning from non-specified values ###
df=df.dropna(subset=["manufacturer"])
### Creating first trace ###
fig=px.scatter_mapbox(df[df["type"]=="hatchback"],
lat="lat", lon="long", hover_name="manufacturer",
hover_data=["odometer", "price", "state"], color_discrete_sequence=["red"],
zoom=4, height=600)
### Customizing map ###
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
### Show ###
fig.show()
### Optionally save it as .html/.png file ###
#title="Geographical position of hatchbacks"
#plotly.offline.plot(fig, filename=f"{title}.html")
#fig.write_image(f"{title}.png")
Up to 3 years of age there is a very noticeable growth.
We notice a maximum peak after 3 years, and from the fifth year onwards the number of cars sold tends to decrease.
Arriving at 12 years we have a local maximum point, probably due to the economic crisis of 2008.
People needed liquidity, so they may have started selling more machines than before.
### Cleaning dataframe from null-year fields ###
df = df[df.year.notnull()]
### Computing ages for each vehicles ###
### For simplicity we will consider 2020 ###
df["age"] = df.year.apply(lambda x: int(2020-x))
### Considering only vehicles with age [0,30] ###
df = df[(df.age >= 0) & (df.age <= 30)]
### Creating a new instance grouping the datagrame by age ###
groupedByAge = df.groupby(["age"]).size().reset_index(name="counts").set_index("age")
### Chart title ###
title="Number of vehicles for each age"
### Creating the line chart ###
fig = px.line(groupedByAge, y="counts", title=title)
### Updating yaxes label ###
fig.update_yaxes(title_text="number of vehicles")
### Show ###
fig.show()
### Optionally save it as .html/.png file ###
#plotly.offline.plot(fig, filename=f"{title}.html")
#fig.write_image(f"{title}.png")
An overwhelming victory of gasoline vehicles can be inferred.
From the data collected by Craigslist, we don't see much growth in electric as per our expectations.
Probably not enough are sold on Craigslist yet to compare to alternatives.
### Cleaning from non-specified values ###
df=df.dropna(subset=["year"])
df=df.dropna(subset=["fuel"])
### Chart title ###
title="Number of cars by fuel"
### Creating the histogram ###
fig=px.histogram(df, x="year", facet_col="fuel", width=800, title=title, facet_col_wrap=2)
### Customizing order shown ###
# fig.update_yaxes(title_text="number of vehicles")
### Show ###
fig.show()
### Optionally save it as .html/.png file ###
#plotly.offline.plot(fig, filename=f"{title}.html")
#fig.write_image(f"{title}.png")
From the graph, we can see that as time progresses, the price, in general, tends to rise, which is certainly due to technological evolution.
As far as diesel is concerned, the median price has been over 20k since 2011, and has continued to rise steadily over the years.
Electric on the other hand is being introduced for the first time in used since 2014 on craiglist. In the last few years as you can tell this is making inroads into the automotive market. So on the price is not yet possible to extrapolate meaningful and concrete analysis.
The median of gasoline generally tends to increase slightly without sharp fluctuations in price.
title="Boxplot price by fuel"
df["price"]=df["price"].dropna()
df=df.sort_values(by=["year"])
fig=px.box(df,y="fuel",orientation="h", x="price", animation_frame="year", range_x=[0, 100000])
fig.update_layout(title=title)
It is intuitive to see how, the average mileage increases proportionally as age increases.
title="Average of odometer by car manufacturer"
fig=px.histogram(df,orientation="h",animation_frame="year", y="manufacturer",
x="odometer", histfunc="avg", height=1000)
fig.update_yaxes(categoryorder="total ascending")
fig.update_layout(title=title)
SUVs are the most frequent vehicle type on the craiglist site. In general, it can be inferred that black tends to be more frequent in most vehicles. You can immediately see that the most frequent colors for SUVs and sedans are:
x = df.type
y = df.paint_color
fig = go.Figure(go.Histogram2d(
x=x,
y=y
))
title = "Colors of vehicles by type"
xaxis_title= "Type of vehicle"
yaxis_title= "Color"
fig.update_layout(
title=title,
xaxis_title=xaxis_title,
yaxis_title=yaxis_title)
fig.show()
In general we can say that the condition is based on the honesty of the advertisers. Leaving this aside we can definitely say that a good part of electric vehicles are described "as new", this is because the electric has been recently introduced in the automotive market.
df=df.dropna(subset=["condition"])
gasLabels = df[df["fuel"]=="gas"].condition.value_counts().head(10).index
gasValues = df[df["fuel"]=="gas"].condition.value_counts().head(10).values
dieselLabels = df[df["fuel"]=="diesel"].condition.value_counts().head(10).index
dieselValues = df[df["fuel"]=="diesel"].condition.value_counts().head(10).values
electricLabels = df[df["fuel"]=="electric"].condition.value_counts().head(10).index
electricValues = df[df["fuel"]=="electric"].condition.value_counts().head(10).values
hybridLabels = df[df["fuel"]=="hybrid"].condition.value_counts().head(10).index
hybridValues = df[df["fuel"]=="hybrid"].condition.value_counts().head(10).values
# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=2, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=gasLabels, values=gasValues, name="Gas Car", title="Gas Cars"), 1, 1)
fig.add_trace(go.Pie(labels=dieselLabels, values=dieselValues, name="Diesel Car",title="Diesel Cars"), 1, 2)
fig.add_trace(go.Pie(labels=electricLabels, values=electricValues, name="Electric Car",title="Electric Cars"), 2, 1)
fig.add_trace(go.Pie(labels=hybridLabels, values=hybridValues, name="Hybrid Car",title="Hybrid Cars"), 2, 2)
# Use hole to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name")
fig.show()
Tendianzalmente la curva della media del prezzo tende a scendere in modo lineare con l'aumentare degli anni. Ciò è dovuto principalmente all'usura del veicolo e all'obsolescenza e senescenza che subisce nel tempo.
def display_price(df, age = (0,12), price = (100,100000), vehicle_type = "all", state = "all"):
# Display the median price of vehicles depending on its type and its state.
if state != "all":
df = df[df["state"] == state]
if vehicle_type != "all":
df = df[df["type"] == vehicle_type]
df = df[(df["age"] <= age[1]) & (df["age"] >= age[0])]
df = df[(df["price"] >= price[0]) & (df["price"] <= price[1])]
price_age = pd.pivot_table(df, values = "price", index = "age", aggfunc= np.median)
price_age.columns = ["Median Price"]
fig = plt.figure(figsize=(12,6))
ax = fig.add_axes([0,0,1,1])
ax2 = fig.add_axes([0.6,0.47,.35,.35])
ax.plot(price_age["Median Price"], lw = 5)
ax2.set_title(f"Vehicle type: {vehicle_type}\nNumber of vehicles: {df.shape[0]}\nCountry: USA\nUS-State: {state}", fontsize = 15)
ax2.axis('off')
ax.set_title(f"Median price by age of the vehicles",fontsize=25)
ax.set_ylim(0,price_age["Median Price"].max()+1000)
ax.set_xlabel("Age", fontsize = 15)
ax.set_ylabel("Median price in $", fontsize = 15)
ax.tick_params(axis='both', which='major', labelsize=15)
plt.show()
df = df.dropna(subset=["type"])
df = df[df.year.notnull()]
df["age"] = df.year.apply(lambda x: int(2020-x))
df = df[(df.age >= 0) & (df.age <= 30)]
for t in df.type.unique()[:5]:
display_price(df, vehicle_type=t)