Group M (Mohamed Boutaleb, Francesco Moro)

Data Science Project - Craigslist Analysis

Introduction

In this notebook will be performed the analysis of the craigslist dataset, a well-known American portal that hosts ads dedicated to jobs, events, shopping, dating and other services.

The dataset was created via web scraping by the author himself.

In [1]:
### Keeping safe from any disaster :) ###
%autosave 30
Autosaving every 30 seconds

Libraries used

For these analyses we used some libraries covered during the course and partly others based on their documentation. Below is the list:

  • numpy
  • pandas
  • plotly

  • seaborn

  • matplotlib
In [2]:
### Imports ###
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly

import seaborn as sns
import matplotlib.pyplot as plt

Reading the dataset

In order to properly import the dataset you need to:

  • Download the zipper from here
  • Extract it
  • Move it into the directory /resources/dataset/.
In [3]:
### Reading .csv dataset ###
df = pd.read_csv("resources/dataset/vehicles.csv")

Preliminary analysis

What are the columns?

In [4]:
df.head(1)
Out[4]:
id url region region_url price year manufacturer model condition cylinders ... drive size type paint_color image_url description county state lat long
0 7184791621 https://duluth.craigslist.org/ctd/d/duluth-200... duluth / superior https://duluth.craigslist.org 6995 2000.0 gmc new sierra 1500 excellent 8 cylinders ... 4wd NaN NaN red https://images.craigslist.org/00n0n_f06ykBMcdh... 2000 *** GMC New Sierra 1500 Ext Cab 157.5 WB... NaN mn 46.8433 -92.255

1 rows × 25 columns

In [5]:
df.columns
Out[5]:
Index(['id', 'url', 'region', 'region_url', 'price', 'year', 'manufacturer',
       'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status',
       'transmission', 'vin', 'drive', 'size', 'type', 'paint_color',
       'image_url', 'description', 'county', 'state', 'lat', 'long'],
      dtype='object')
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423857 entries, 0 to 423856
Data columns (total 25 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            423857 non-null  int64  
 1   url           423857 non-null  object 
 2   region        423857 non-null  object 
 3   region_url    423857 non-null  object 
 4   price         423857 non-null  int64  
 5   year          328743 non-null  float64
 6   manufacturer  313242 non-null  object 
 7   model         325384 non-null  object 
 8   condition     176719 non-null  object 
 9   cylinders     197679 non-null  object 
 10  fuel          327214 non-null  object 
 11  odometer      270585 non-null  float64
 12  title_status  327759 non-null  object 
 13  transmission  328065 non-null  object 
 14  vin           184420 non-null  object 
 15  drive         231119 non-null  object 
 16  size          102627 non-null  object 
 17  type          241157 non-null  object 
 18  paint_color   222203 non-null  object 
 19  image_url     329661 non-null  object 
 20  description   329654 non-null  object 
 21  county        0 non-null       float64
 22  state         423857 non-null  object 
 23  lat           324404 non-null  float64
 24  long          324404 non-null  float64
dtypes: float64(5), int64(2), object(18)
memory usage: 80.8+ MB

A little more concrete vision

Attempting to launch the following cell we immediately notice that the computation time is very high.

In [7]:
# plt.figure(figsize=(27,12))
# plt.title("Missing values for each column")
# dfT=df.isnull().transpose()
# fig=sns.heatmap(dfT)

# plt.savefig('density.png', dpi=300)

A first approach

The considered dataset needs rather onerous computational times for certain operations, therefore, we tried to circumvent the problem by initially considering only the observations recorded between 2016 and 2019.

This was not a very efficient approach: on the one hand the computational time had been reduced, but on the other hand we lost important data, reducing the value of the collection.

In [8]:
### No more used ###

### Filtering more recent occurrences ###
# mask=(df["year"]>2015.0) & (df["year"]<2020)

### Applying the mask ###
# dfByYear=df[mask]

### Convert dataframe object to .csv file ###
# dfByYear.to_csv("resources/dataset/filteredByYear.csv", index=False)

### Reading from filesystem .csv dataset ###
#dfByYear = pd.read_csv("resources/dataset/filteredByYear.csv")

Sampling

We therefore decided to consider 80'000 samples for the longer calculations.

In [9]:
### Dataframe sampling ###
dfs=df.sample(n=80000)
In [10]:
plt.figure(figsize=(27,12))
plt.title("Missing values for each column")
dfT=dfs.isnull().transpose()
fig=sns.heatmap(dfT)

plt.savefig('density.png', dpi=300)
In [11]:
### Dropping for semplicity useless columns ###
df = df.drop(columns=["url","region_url","image_url","county"])

What are the best selling brands?

The first obvious question we asked ourselves is what is the best selling brand on Craiglist.

We expected as an answer neither a luxury brand nor a low quality one, otherwise no one would buy it.

In [12]:
### Create a new grouped by manufacturer dataframe object###
occurrenciesByManufacturer = dfs.groupby(["manufacturer"]).size().reset_index(name="counts")

### Getting relative frequencies ###
occurrenciesByManufacturer["relative_freq"] = occurrenciesByManufacturer["counts"] /\
    occurrenciesByManufacturer["counts"].sum()

### Sorting dataframe by number of occurrences for each manufacturer ###   
occurrenciesByManufacturer = occurrenciesByManufacturer.sort_values(
    "counts", ascending=True)

### Giving our char a name ###
title = "Number of traded cars for each manufacturer"

### Creating the bar chart ###
fig = px.bar(occurrenciesByManufacturer, y="manufacturer",
             x="counts", text="counts", title=title, height=1000)

### Formatting values labels ###
fig.update_traces(texttemplate="%{text:.2s}", textposition="outside")

### Show ###
fig.show()

# save as html
#plotly.offline.plot(fig, filename=f"{figTitle}.html")

Why Ford?

The causes are mainly as follows:

  • Founded in the USA in 1903
  • Industrial Revolution -> with the advent of the assembly line, mass production was born of Ford as well.
  • American nationalism
  • Famous for rather large vehicles, loved by Americans.

What is the best-selling model?

The answer, in our not-so-predictable opinion, is the pickup.

In fact, this type of vehicle is much bought and sold in the American territory. Even in the movies it is seen as the typical car used by the good family man.

Some of the most relevant reasons:

  • Enough power to tow or transport just about anything.
  • More than enough interior space
  • Fair amount of horsepower to transport important loads
  • Remarkable fuel economy
  • Studies showing that pickups are safer than other cars: cheaper insurance!
In [13]:
### Considering only for vehicles ###
result=df.loc[df["manufacturer"]=="ford", 'model']

### Create a new dataframe instance based on the previous mask ###
resultDf=pd.DataFrame(result)

### Creating the histogram ###
px.histogram(resultDf, barmode="overlay", y="model", height=1000)

### Taking some model labels and values 
modelLabels = df[df["manufacturer"]=="ford"].model.value_counts().head(10).index
modelValues = df[df["manufacturer"]=="ford"].model.value_counts().head(10).values

### Making place for other traces ###
fig = make_subplots(rows=1, cols=1, specs=[[{"type":"domain"}]])

### Adding the new trace###
fig.add_trace(go.Pie(labels=modelLabels, values=modelValues, title="Ford models"),
              1, 1)

### Show ###
fig.show()

### Optionally save it as .html/.png file ###
#plotly.offline.plot(fig, filename=f"{title}.html")
#fig.write_image(f"{title}.png")

Where are pickups used?

Initially, we thought that pickups were present more in arid/off-road areas and less in urban areas.

However, the following chart shows, on the contrary, how the distribution of pickups is dense especially in large cities.

In [14]:
### Cleaning from non-specified values ###
df=df.dropna(subset=["manufacturer"])

### Creating first trace ###
fig=px.scatter_mapbox(df[df["type"]=="pickup"], 
                             lat="lat", lon="long", hover_name="manufacturer", 
                        hover_data=["odometer", "price", "state"], color_discrete_sequence=["blue"], 
                             zoom=4, height=600)

### Customizing map ###
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

### Show ###
fig.show()

### Optionally save it as .html/.png file ###
#title="Geographical position of pickups"
#plotly.offline.plot(fig, filename=f"{title}.html")
#fig.write_image(f"{title}.png")

Pickup-berline comparison

While comparing the location of all pickups with that of sedans, thus vehicles that tend to be used in urban areas, we did not find that there is a possible logical pattern that would meet our expectations.

In [15]:
### Cleaning from non-specified values ###
df=df.dropna(subset=["manufacturer"])

### Creating first trace ###
fig=px.scatter_mapbox(df[df["type"]=="hatchback"], 
                             lat="lat", lon="long", hover_name="manufacturer", 
                        hover_data=["odometer", "price", "state"], color_discrete_sequence=["red"], 
                             zoom=4, height=600)

### Customizing map ###
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

### Show ###
fig.show()

### Optionally save it as .html/.png file ###
#title="Geographical position of hatchbacks"
#plotly.offline.plot(fig, filename=f"{title}.html")
#fig.write_image(f"{title}.png")

After how many years is a vehicle resold?

Up to 3 years of age there is a very noticeable growth. We notice a maximum peak after 3 years, and from the fifth year onwards the number of cars sold tends to decrease.

Arriving at 12 years we have a local maximum point, probably due to the economic crisis of 2008. People needed liquidity, so they may have started selling more machines than before.

In [16]:
### Cleaning dataframe from null-year fields ###
df = df[df.year.notnull()]

### Computing ages for each vehicles ###
### For simplicity we will consider 2020 ###
df["age"] = df.year.apply(lambda x: int(2020-x))

### Considering only vehicles with age [0,30] ###
df = df[(df.age >= 0) & (df.age <= 30)]
In [17]:
### Creating a new instance grouping the datagrame by age ###
groupedByAge = df.groupby(["age"]).size().reset_index(name="counts").set_index("age")

### Chart title ###
title="Number of vehicles for each age"

### Creating the line chart ###
fig = px.line(groupedByAge, y="counts", title=title)

### Updating yaxes label ###
fig.update_yaxes(title_text="number of vehicles")

### Show ###
fig.show()

### Optionally save it as .html/.png file ###
#plotly.offline.plot(fig, filename=f"{title}.html")
#fig.write_image(f"{title}.png")

As time progresses which fuel is preferred?

An overwhelming victory of gasoline vehicles can be inferred.
From the data collected by Craigslist, we don't see much growth in electric as per our expectations.
Probably not enough are sold on Craigslist yet to compare to alternatives.

In [18]:
### Cleaning from non-specified values ###
df=df.dropna(subset=["year"])
df=df.dropna(subset=["fuel"])

### Chart title ###
title="Number of cars by fuel"

### Creating the histogram ###
fig=px.histogram(df, x="year", facet_col="fuel", width=800, title=title, facet_col_wrap=2)

### Customizing order shown ###
# fig.update_yaxes(title_text="number of vehicles")

### Show ###
fig.show()

### Optionally save it as .html/.png file ###
#plotly.offline.plot(fig, filename=f"{title}.html")
#fig.write_image(f"{title}.png")

How do prices vary with time?

From the graph, we can see that as time progresses, the price, in general, tends to rise, which is certainly due to technological evolution.

As far as diesel is concerned, the median price has been over 20k since 2011, and has continued to rise steadily over the years.
Electric on the other hand is being introduced for the first time in used since 2014 on craiglist. In the last few years as you can tell this is making inroads into the automotive market. So on the price is not yet possible to extrapolate meaningful and concrete analysis.
The median of gasoline generally tends to increase slightly without sharp fluctuations in price.

In [19]:
title="Boxplot price by fuel"

df["price"]=df["price"].dropna()
df=df.sort_values(by=["year"])
fig=px.box(df,y="fuel",orientation="h", x="price", animation_frame="year", range_x=[0, 100000])

fig.update_layout(title=title)

What is the average mileage for each brand as the years change?

It is intuitive to see how, the average mileage increases proportionally as age increases.

In [20]:
title="Average of odometer by car manufacturer"

fig=px.histogram(df,orientation="h",animation_frame="year",  y="manufacturer", 
                 x="odometer", histfunc="avg", height=1000)
fig.update_yaxes(categoryorder="total ascending")
fig.update_layout(title=title)

How is color distributed by vehicle type?

SUVs are the most frequent vehicle type on the craiglist site. In general, it can be inferred that black tends to be more frequent in most vehicles. You can immediately see that the most frequent colors for SUVs and sedans are:

  • black
  • silver
  • white
  • grey
  • blue
  • red
In [21]:
x = df.type
y = df.paint_color

fig = go.Figure(go.Histogram2d(
        x=x,
        y=y
    ))

title = "Colors of vehicles by type"
xaxis_title= "Type of vehicle"
yaxis_title= "Color"

fig.update_layout(
title=title,
    xaxis_title=xaxis_title,
    yaxis_title=yaxis_title)


fig.show()

Do conditions tend to vary based on the type of feed?

In general we can say that the condition is based on the honesty of the advertisers. Leaving this aside we can definitely say that a good part of electric vehicles are described "as new", this is because the electric has been recently introduced in the automotive market.

In [22]:
df=df.dropna(subset=["condition"])

gasLabels = df[df["fuel"]=="gas"].condition.value_counts().head(10).index
gasValues = df[df["fuel"]=="gas"].condition.value_counts().head(10).values

dieselLabels = df[df["fuel"]=="diesel"].condition.value_counts().head(10).index
dieselValues = df[df["fuel"]=="diesel"].condition.value_counts().head(10).values

electricLabels = df[df["fuel"]=="electric"].condition.value_counts().head(10).index
electricValues = df[df["fuel"]=="electric"].condition.value_counts().head(10).values

hybridLabels = df[df["fuel"]=="hybrid"].condition.value_counts().head(10).index
hybridValues = df[df["fuel"]=="hybrid"].condition.value_counts().head(10).values

# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=2, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}]])

fig.add_trace(go.Pie(labels=gasLabels, values=gasValues, name="Gas Car", title="Gas Cars"), 1, 1)
fig.add_trace(go.Pie(labels=dieselLabels, values=dieselValues, name="Diesel Car",title="Diesel Cars"), 1, 2)
fig.add_trace(go.Pie(labels=electricLabels, values=electricValues, name="Electric Car",title="Electric Cars"), 2, 1)
fig.add_trace(go.Pie(labels=hybridLabels, values=hybridValues, name="Hybrid Car",title="Hybrid Cars"), 2, 2)



# Use hole to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name")

fig.show()

Come varia negli anni il prezzo dei veicoli in base alla tipologia?

Tendianzalmente la curva della media del prezzo tende a scendere in modo lineare con l'aumentare degli anni. Ciò è dovuto principalmente all'usura del veicolo e all'obsolescenza e senescenza che subisce nel tempo.

In [23]:
def display_price(df, age = (0,12), price = (100,100000), vehicle_type = "all", state = "all"):
    # Display the median price of vehicles depending on its type and its state.

    if state != "all":
        df = df[df["state"] == state]

    if vehicle_type != "all":
        df = df[df["type"] == vehicle_type]

    df = df[(df["age"] <= age[1]) & (df["age"] >= age[0])]

    df = df[(df["price"] >= price[0]) & (df["price"] <= price[1])]

    price_age = pd.pivot_table(df, values = "price", index = "age", aggfunc= np.median)
    price_age.columns = ["Median Price"]

    fig = plt.figure(figsize=(12,6))
    ax = fig.add_axes([0,0,1,1])
    ax2 = fig.add_axes([0.6,0.47,.35,.35])

    ax.plot(price_age["Median Price"], lw = 5)

    ax2.set_title(f"Vehicle type: {vehicle_type}\nNumber of vehicles: {df.shape[0]}\nCountry: USA\nUS-State: {state}", fontsize = 15)
    ax2.axis('off')

    ax.set_title(f"Median price by age of the vehicles",fontsize=25)
    ax.set_ylim(0,price_age["Median Price"].max()+1000)
    ax.set_xlabel("Age", fontsize = 15)
    ax.set_ylabel("Median price in $", fontsize = 15)

    ax.tick_params(axis='both', which='major', labelsize=15) 

    plt.show()
    
    
df = df.dropna(subset=["type"])
    
df = df[df.year.notnull()]
df["age"] = df.year.apply(lambda x: int(2020-x))
df = df[(df.age >= 0) & (df.age <= 30)]
    

for t in df.type.unique()[:5]:
    display_price(df, vehicle_type=t)