kaggle

Python

list

遍历list

list = ["01", "02", "03", "04"]
	# 1
    for num in list:
        print(num)
    
    # 2
    for i in range(len(list)):
        print(list[i])
    
    # 3
    for num in iter(list):
        print(num)
        
    # 4
    for obj in enumerate(list):
        print(obj)
    for index, value in enumerate(list):
    	print("index =", index, "value =", value)

比较list元素, [False, True, True, False]

# 使用列表推导式
print([x > 2 for x in a])

# 使用 map 函数
print(list(map(lambda x: x > 2, a)))

dict

def multi_word_search(doc_list, keywords):
    """
    Takes list of documents (each document is a string) and a list of keywords.  
    Returns a dictionary where each key is a keyword, and the value is a list of indices
    (from doc_list) of the documents containing that keyword

    >>> doc_list = ["The Learn Python Challenge Casino.", "They bought a car and a casino", "Casinoville"]
    >>> keywords = ['casino', 'they']
    >>> multi_word_search(doc_list, keywords)
    {'casino': [0, 1], 'they': [1]}
    """
    indices = {} 
    for x in keywords:
        indices[x] = []
    for i, doc in enumerate(doc_list):
        tokens = doc.split()
        normalized = [token.rstrip('.,').lower() for token in tokens]
        for x in keywords:
            if x.lower() in normalized:
                indices[x].append(i)
    return indices

判断key是否在dict中

dict = {'name': '','age': '','sex': ''}
# 1
print(dict.has_key('name')) 
# 2
print('name' in dict.keys())
# 3
print('name' in dict)  # 结果返回True

Pandas

Creating, Reading and Writing

import pandas as pd

Creating data：There are two core objects in pandas: the DataFrame and the Series.

DataFrame

pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})

	Yes	No
0	50	131
1	21	2

pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})

	Bob	Sue
0	I liked it.	Pretty good.
1	It was awful.	Bland.

pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])

	Bob	Sue
Product A	I liked it.	Pretty good.
Product B	It was awful.	Bland.

Series

pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

Indexing, Selecting & Assigning

reviews

country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston…	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth…	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
…	…	…	…	…	…	…	…	…	…	…	…	…	…
129969	France	A dry style of Pinot Gris, this is crisp with …	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte…	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car…	Gewürztraminer	Domaine Schoffit

查看某列

reviews.country # reviews['country']

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

查看某列的某一行

reviews['country'][0]

'Italy'

iloc

loc和iloc都是行第一，列第二， iloc基于数字，loc基于标签

查看第一行

reviews.iloc[0] # 查看总体数据的某一行

country                                                    Italy
description    Aromas include tropical fruit, broom, brimston...
                                     ...                        
variety                                              White Blend
winery                                                   Nicosia
Name: 0, Length: 13, dtype: object

查看第一列

reviews.iloc[:, 0]

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

查看第一列的前三行

reviews.iloc[:3, 0] # reviews.iloc[[0, 1, 2], 0]

0       Italy
1    Portugal
2          US
Name: country, dtype: object

查看最后5行

reviews.iloc[-5:]

country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
129966	Germany	Notes of honeysuckle and cantaloupe sweeten th…	Brauneberger Juffer-Sonnenuhr Spätlese	90	28.0	Mosel	NaN	NaN	Anna Lee C. Iijima	NaN	Dr. H. Thanisch (Erben Müller-Burggraef) 2013 …	Riesling	Dr. H. Thanisch (Erben Müller-Burggraef)
129967	US	Citation is given as much as a decade of bottl…	NaN	90	75.0	Oregon	Oregon	Oregon Other	Paul Gregutt	@paulgwine	Citation 2004 Pinot Noir (Oregon)	Pinot Noir	Citation
129968	France	Well-drained gravel soil gives this wine its c…	Kritt	90	30.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Gresser 2013 Kritt Gewurztraminer (Als…	Gewürztraminer	Domaine Gresser
129969	France	A dry style of Pinot Gris, this is crisp with …	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte…	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car…	Gewürztraminer	Domaine Schoffit

loc

reviews.loc[0, 'country'] # country 列的第一行

'Italy'

‘taster_name’, ‘taster_twitter_handle’, ‘points’列的所有行

reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]

	taster_name	taster_twitter_handle	points
0	Kerin O’Keefe	@kerinokeefe	87
1	Roger Voss	@vossroger	87
…	…	…	…
129969	Roger Voss	@vossroger	90
129970	Roger Voss	@vossroger	90

条件选择

reviews.country == 'Italy'

0          True
1         False
          ...  
129969    False
129970    False
Name: country, Length: 129971, dtype: bool

选出 country 列为 Italy 的行

reviews.loc[reviews.country == 'Italy']

country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston…	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
6	Italy	Here’s a bright, informal red that opens with …	Belsito	87	16.0	Sicily & Sardinia	Vittoria	NaN	Kerin O’Keefe	@kerinokeefe	Terre di Giurfo 2013 Belsito Frappato (Vittoria)	Frappato	Terre di Giurfo
…	…	…	…	…	…	…	…	…	…	…	…	…	…
129961	Italy	Intense aromas of wild cherry, baking spice, t…	NaN	90	30.0	Sicily & Sardinia	Sicilia	NaN	Kerin O’Keefe	@kerinokeefe	COS 2013 Frappato (Sicilia)	Frappato	COS
129962	Italy	Blackberry, cassis, grilled herb and toasted a…	Sàgana Tenuta San Giacomo	90	40.0	Sicily & Sardinia	Sicilia	NaN	Kerin O’Keefe	@kerinokeefe	Cusumano 2012 Sàgana Tenuta San Giacomo Nero d…	Nero d’Avola	Cusumano

reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

reviews.loc[reviews.country.isin(['Italy', 'France'])]

reviews.loc[reviews.price.notnull()]

赋值

reviews['critic'] = 'everyone'

将 index_backwards 的值设置为倒序排列

reviews['index_backwards'] = range(len(reviews), 0, -1) 
reviews['index_backwards']

0         129971
1         129970
           ...  
129969         2
129970         1
Name: index_backwards, Length: 129971, dtype: int64

Summary Functions and Maps

reviews.points.describe()

reviews.points.mean()

返回该列的所有不重复的值

reviews.taster_name.unique()

array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

map - Series

review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean)

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

apply - DataFrame

def remean_points(row):
    row.points = row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')

country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston…	Vulkà Bianco	-1.447138	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth…	Avidagos	-1.447138	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
…	…	…	…	…	…	…	…	…	…	…	…	…	…
129969	France	A dry style of Pinot Gris, this is crisp with …	NaN	1.552862	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte…	Lieu-dit Harth Cuvée Caroline	1.552862	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car…	Gewürztraminer	Domaine Schoffit

其他

review_points_mean = reviews.points.mean()
reviews.points - review_points_mean

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

reviews.country + " - " + reviews.region_1

0            Italy - Etna
1                     NaN
               ...       
129969    France - Alsace
129970    France - Alsace
Length: 129971, dtype: object

Grouping and Sorting

Groupwise analysis

reviews.groupby('points').points.count()

points
80     397
81     692
      ... 
99      33
100     19
Name: points, Length: 21, dtype: int64

reviews.groupby('points').price.min()

points
80      5.0
81      5.0
       ... 
99     44.0
100    80.0
Name: price, Length: 21, dtype: float64

# 从每个酒厂中选择第一个被评论的葡萄酒名称：
reviews.groupby('winery').apply(lambda df: df.title.iloc[0])

winery
1+1=3                          1+1=3 NV Rosé Sparkling (Cava)
10 Knots                 10 Knots 2010 Viognier (Paso Robles)
                                  ...                        
àMaurice    àMaurice 2013 Fred Estate Syrah (Walla Walla V...
Štoka                         Štoka 2009 Izbrani Teran (Kras)
Length: 16757, dtype: object

# 按国家和省份挑选最好的葡萄酒:
reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
country	province
Argentina	Mendoza Province	Argentina	If the color doesn’t tell the full story, the …	Nicasia Vineyard	97	120.0	Mendoza Province	Mendoza	NaN	Michael Schachner	@wineschach	Bodega Catena Zapata 2006 Nicasia Vineyard Mal…	Malbec	Bodega Catena Zapata
Other	Argentina	Take note, this could be the best wine Colomé …	Reserva	95	90.0	Other	Salta	NaN	Michael Schachner	@wineschach	Colomé 2010 Reserva Malbec (Salta)	Malbec	Colomé
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
Uruguay	San Jose	Uruguay	Baked, sweet, heavy aromas turn earthy with ti…	El Preciado Gran Reserva	87	50.0	San Jose	NaN	NaN	Michael Schachner	@wineschach	Castillo Viejo 2005 El Preciado Gran Reserva R…	Red Blend	Castillo Viejo
Uruguay	Uruguay	Cherry and berry aromas are ripe, healthy and …	Blend 002 Limited Edition	91	22.0	Uruguay	NaN	NaN	Michael Schachner	@wineschach	Narbona NV Blend 002 Limited Edition Tannat-Ca…	Tannat-Cabernet Franc	Narbona

reviews.groupby(['country']).price.agg([len, min, max])

	len	min	max
country
Argentina	3800	4.0	230.0
Armenia	2	14.0	15.0
…	…	…	…
Ukraine	14	6.0	13.0
Uruguay	109	10.0	130.0

countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed

		len
country	province
Argentina	Mendoza Province	3264
Other	536
…	…	…
Uruguay	San Jose	3
Uruguay	24

这时候需要resetIndex

countries_reviewed.reset_index()

country	province	len
0	Argentina	Mendoza Province	3264
1	Argentina	Other	536
…	…	…	…
423	Uruguay	San Jose	3
424	Uruguay	Uruguay	24

Sorting

countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')  # sort_values()

	country	province	len
179	Greece	Muscat of Kefallonian	1
192	Greece	Sterea Ellada	1
…	…	…	…
415	US	Washington	8639
392	US	California	36247

countries_reviewed.sort_values(by='len', ascending=False) # sort_values()

	country	province	len
392	US	California	36247
415	US	Washington	8639
…	…	…	…
63	Chile	Coelemu	1
149	Greece	Beotia	1

countries_reviewed.sort_index() # sort_index()

	country	province	len
0	Argentina	Mendoza Province	3264
1	Argentina	Other	536
…	…	…	…
423	Uruguay	San Jose	3
424	Uruguay	Uruguay	24

countries_reviewed.sort_values(by=['country', 'len'])

	country	province	len
1	Argentina	Other	536
0	Argentina	Mendoza Province	3264
…	…	…	…
424	Uruguay	Uruguay	24
419	Uruguay	Canelones	43

Dtypes and Missing Values

Dtypes

The data type for a column in a DataFrame or a Series is known as the dtype.

reviews.price.dtype

dtype('float64')

Alternatively, the dtypes property returns the dtype of every column in the DataFrame:

reviews.dtypes

country        object
description    object
                ...  
variety        object
winery         object
Length: 13, dtype: object

reviews.points.astype('float64')

0         87.0
1         87.0
          ... 
129969    90.0
129970    90.0
Name: points, Length: 129971, dtype: float64

A DataFrame or Series index has its own dtype, too:

reviews.index.dtype

dtype('int64')

Missing data

reviews[pd.isnull(reviews.country)]

country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
913	NaN	Amber in color, this wine has aromas of peach …	Asureti Valley	87	30.0	NaN	NaN	NaN	Mike DeSimone	@worldwineguys	Gotsa Family Wines 2014 Asureti Valley Chinuri	Chinuri	Gotsa Family Wines
3131	NaN	Soft, fruity and juicy, this is a pleasant, si…	Partager	83	NaN	NaN	NaN	NaN	Roger Voss	@vossroger	Barton & Guestier NV Partager Red	Red Blend	Barton & Guestier
…	…	…	…	…	…	…	…	…	…	…	…	…	…
129590	NaN	A blend of 60% Syrah, 30% Cabernet Sauvignon a…	Shah	90	30.0	NaN	NaN	NaN	Mike DeSimone	@worldwineguys	Büyülübağ 2012 Shah Red	Red Blend	Büyülübağ
129900	NaN	This wine offers a delightful bouquet of black…	NaN	91	32.0	NaN	NaN	NaN	Mike DeSimone	@worldwineguys	Psagot 2014 Merlot	Merlot	Psagot

reviews.region_2.fillna("Unknown") # 将所有region_2列为NaN的行设置为“Unknown”

0         Unknown
1         Unknown
           ...   
129969    Unknown
129970    Unknown
Name: region_2, Length: 129971, dtype: object

# Example: change Twitter handle from @kerinokeefe to @kerino
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")

0            @kerino
1         @vossroger
             ...    
129969    @vossroger
129970    @vossroger
Name: taster_twitter_handle, Length: 129971, dtype: object

Renaming and Combining

Renaming

reviews.rename(columns={'points': 'score'}) # 将 points 列 改名为 score

reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'}) # 将前两行号变为 firstEntry 和 secondEntry

country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
firstEntry	Italy	Aromas include tropical fruit, broom, brimston…	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
secondEntry	Portugal	This is ripe and fruity, a wine that is smooth…	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
…	…	…	…	…	…	…	…	…	…	…	…	…	…

reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')

fields	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
wines
0	Italy	Aromas include tropical fruit, broom, brimston…	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth…	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
…	…	…	…	…	…	…	…	…	…	…

Combining

我们有时需要组合不同的dataframe和/或Series。Pandas有三个核心方法来做到这一点。按照复杂度的递增顺序，它们分别是concat()、join()和merge()。merge()所能做的大部分事情也可以用join()更简单地完成，因此我们将省略它，在这里只关注前两个函数。

concat()

canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])

	video_id	trending_date	title	channel_title	category_id	publish_time	tags	views	likes	dislikes	comment_count	thumbnail_link	comments_disabled	ratings_disabled	video_error_or_removed	description
0	n1WpP7iowLc	17.14.11	Eminem - Walk On Water (Audio) ft. Beyoncé	EminemVEVO	10	2017-11-10T17:00:03.000Z	Eminem\|”Walk”\|”On”\|”Water”\|”Aftermath/Shady/In…	17158579	787425	43420	125882	https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg	False	False	False	Eminem’s new track Walk on Water ft. Beyoncé i…
1	0dBIkQ4Mz1M	17.14.11	PLUSH - Bad Unboxing Fan Mail	iDubbbzTV	23	2017-11-13T17:00:00.000Z	plush\|”bad unboxing”\|”unboxing”\|”fan mail”\|”id…	1014651	127794	1688	13030	https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg	False	False	False	STill got a lot of packages. Probably will las…
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
38914	-DRsfNObKIQ	18.14.06	Eleni Foureira - Fuego - Cyprus - LIVE - First…	Eurovision Song Contest	24	2018-05-08T20:32:32.000Z	Eurovision Song Contest\|”2018”\|”Lisbon”\|”Cypru…	14317515	151870	45875	26766	https://i.ytimg.com/vi/-DRsfNObKIQ/default.jpg	False	False	False	Eleni Foureira represented Cyprus at the first…
38915	4YFo4bdMO8Q	18.14.06	KYLE - Ikuyo feat. 2 Chainz & Sophia Black [A…	SuperDuperKyle	10	2018-05-11T04:06:35.000Z	Kyle\|”SuperDuperKyle”\|”Ikuyo”\|”2 Chainz”\|”Soph…	607552	18271	274	1423	https://i.ytimg.com/vi/4YFo4bdMO8Q/default.jpg	False	False	False	Debut album ‘Light of Mine’ out now: http://ky…

join()

就复杂度而言，最中间的组合器是join()。join()允许您组合具有共同索引的不同DataFrame对象。

left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

# 拉出恰好在加拿大和英国同一天流行的视频
left.join(right, lsuffix='_CAN', rsuffix='_UK')

lsuffix和rsuffix参数在这里是必需的，因为数据在英国和加拿大数据集中具有相同的列名。如果这不是真的(比如说，因为我们事先重命名了它们)，我们就不需要它们。

		video_id_CAN	channel_title_CAN	category_id_CAN	publish_time_CAN	tags_CAN	views_CAN	likes_CAN	dislikes_CAN	comment_count_CAN	thumbnail_link_CAN	…	tags_UK	views_UK	likes_UK	dislikes_UK	comment_count_UK	thumbnail_link_UK	comments_disabled_UK	ratings_disabled_UK	video_error_or_removed_UK	description_UK
title	trending_date
!! THIS VIDEO IS NOTHING BUT PAIN !!	18.04.01	PNn8sECd7io	Markiplier	20	2018-01-03T19:33:53.000Z	getting over it\|”markiplier”\|”funny moments”\|”…	835930	47058	1023	8250		…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
#1 Fortnite World Rank - 23 Solo Wins!	18.09.03	DvPW66IFhMI	AlexRamiGaming	20	2018-03-09T07:15:52.000Z	PS4 Battle Royale\|”PS4 Pro Battle Royale”\|”Bat…	212838	5199	542	11		…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
🚨 BREAKING NEWS 🔴 Raja Live all Slot Channels Welcome 🎰	18.07.05	Wt9Gkpmbt44	TheBigJackpot	24	2018-05-07T06:58:59.000Z	Slot Machine\|”win”\|”Gambling”\|”Big Win”\|”raja”…	28973	2167	175	10		…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
🚨Active Shooter at YouTube Headquarters	18.04.04	Az72jrKbANA	Right Side Broadcasting Network	25	2018-04-03T23:12:37.000Z	YouTube shooter\|”YouTube active shooter”\|”acti…	103513	1722	181	76		…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Intro to ML

DecisionTreeRegressor

# Code you have previously used to load data
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

home_data = pd.read_csv(iowa_file_path)
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.0f}".format(val_mae))

优化DecisionTreeRegressor

改变决策树的最大深度

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}
best_tree_size = min(scores, key=scores.get)

final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=0)
final_model.fit(X, y)

RandomForest

from sklearn.ensemble import RandomForestRegressor

# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state = 1)

# fit your model
rf_model.fit(train_X, train_y)

# Calculate the mean absolute error of your Random Forest model on the validation data
rf_val_mae = mean_absolute_error(rf_model.predict(val_X), val_y)

print("Validation MAE for Random Forest Model: {}".format(rf_val_mae))

Intermediate ML

处理缺失值

处理缺失值的三种方法：

A Simple Option: Drop Columns with Missing Values

A Better Option: Imputation 归因

Imputation fills in the missing values with some number. For instance, we can fill in the mean value along each column.

An Extension To Imputation

Example

import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')
y = data.Price # Select target

# 从一个名为“data”的数据集中删除“Price”列
melb_predictors = data.drop(['Price'], axis=1)
# 从DataFrame中选择那些不包含object类型的列（也就是数值型列）
# exclude=['object']表示排除object类型的列，只选择数值型列
X = melb_predictors.select_dtypes(exclude=['object'])

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

Score of Approach 1(drop)

# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))
# MAE from Approach 1 (Drop columns with missing values): 183550.22137772635

Score of Approach 2 (Imputation)

from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# 使用fit_transform方法在训练数据集X_train中对缺失值进行插补，并将结果存储在一个新的数据帧imputed_X_train中。
# 使用transform方法在验证数据集X_valid中对缺失值进行插补，并将结果存储在一个新的数据帧imputed_X_valid中。
# 因为插补会删除列名，所以通过将列名设置为原始数据集X_train和X_valid的列名来将列名重新分配给imputed_X_train和imputed_X_valid。
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))
# MAE from Approach 2 (Imputation): 178166.46269899711

Score of Approach 3 (Extension to Imputation)

# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))
# MAE from Approach 3 (An Extension to Imputation): 178927.503183954

imputation 比 dropping 好的原因

print(X_train.shape)

missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

(10864, 12)
Car               49
BuildingArea    5156
YearBuilt       4307
dtype: int64
# 首先使用X_train.isnull()将X_train数据集中所有的缺失值标记为True, 非缺失值标记为False, 并返回一个相同维度的布尔型DataFrame。
# 然后使用sum()函数将每列中的True值加起来，即每列中缺失值的数量。最后将缺失值数量大于0的列打印出来，以显示X_train数据集中所有含有缺失值的列及其缺失值数量。

分类变量

有很多非数值型的数据。下面介绍如何将其用于机器学习。(离散化)

处理非数值变量的三种方式：

drop columns
Ordinal Encoding 顺序编码

独热编码

Example

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

# Drop columns with missing values (simplest approach)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()] 
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

# "Cardinality" means the number of unique values in a column 选择基数 < 10 的列
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 
                        and X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

# X_train.dtypes 返回一个Series对象，其中包含 X_train 中每个列的数据类型。
s = (X_train.dtypes == 'object') 
# 从Series对象 s 中，选取值为True的所有元素所对应的索引（即列名），并将它们存储在一个列表中
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols) # Categorical variables: ['Type', 'Method', 'Regionname']

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

Score of Approach 1 (drop)

drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
# MAE from Approach 1 (Drop categorical variables): 175703.48185157913

Score of Approach 2 (Ordinal Encoding)

from sklearn.preprocessing import OrdinalEncoder

label_X_train = X_train.copy() # Make copy to avoid changing original data 
label_X_valid = X_valid.copy()

# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])

print("MAE from Approach 2 (Ordinal Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))
# MAE from Approach 2 (Ordinal Encoding): 165936.40548390493

Score of Approach 3 (One-Hot Encoding)

from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

# Ensure all columns have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)

print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))
# MAE from Approach 3 (One-Hot Encoding): 166089.4893009678

流水线

在机器学习中，pipeline是指将多个数据处理和模型训练步骤组合成一个完整的工作流程。使用pipeline可以使数据预处理、特征提取、模型训练和评估等步骤高度自动化和标准化，从而提高开发人员的工作效率和模型的可重复性。下面是pipeline的一些用处：

简化开发流程：通过pipeline，可以将多个步骤组合成一个整体，并且自动化和标准化这些步骤。这样可以大大减少开发人员的工作量，并简化整个开发流程。
统一数据处理：pipeline可以帮助将数据预处理步骤（如缺失值填充、标准化等）统一应用到所有的数据中，从而保证所有的数据都是在同样的条件下进行训练的。
提高模型效率：pipeline可以优化特征提取步骤，使得模型可以快速地从原始数据中提取有用的信息，从而提高模型的效率和准确率。
提高模型可重复性：pipeline可以确保在不同的环境下模型的训练和评估过程都是相同的，从而提高模型的可重复性。
简化部署流程：pipeline可以将模型的训练和部署流程整合在一起，使得模型的部署变得更加简单和快速。

Example

import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

y = data.Price
X = data.drop(['Price'], axis=1)

X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]
my_cols = categorical_cols + numerical_cols

X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

S1 定义pipeline预处理步骤

补缺 2. 处理分类变量

# we use the ColumnTransformer class to bundle together different preprocessing steps
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

为数值数据创建一个numerical_transformer对象，该对象使用SimpleImputer类将缺失值替换为常数值。
为分类数据创建一个categorical_transformer对象，该对象使用SimpleImputer类将缺失值替换为最频繁的值，并使用OneHotEncoder类将分类变量编码为二进制指标变量。
使用ColumnTransformer类创建preprocessor对象，该对象将numerical_transformer应用于numerical_cols中的数值变量，并将categorical_transformer应用于categorical_cols中的分类变量。

S2 Define Model

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

S3 Create and Evaluate the Pipeline

from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score) # MAE: 160679.18917034855

交叉验证

概念

Example

import pandas as pd

data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']

X = data[cols_to_use]
y = data.Price

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                              ('model', RandomForestRegressor(n_estimators=50,
                                                              random_state=0))
                             ])

我们使用scikit-learn中的cross_val_score() 获得交叉验证分数。我们用cv参数来设置fold次数。

from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)
# print(scores.mean())

结果：

MAE scores:
 [301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]

XGBoost

使用梯度提升(下降)优化模型。这种方法在Kaggle竞赛中占主导地位，并在各种数据集上取得了最先进的结果

We refer to the random forest method as an “ensemble method”. By definition, ensemble methods combine the predictions of several models (e.g., several trees, in the case of random forests).

Next, we’ll learn about another ensemble method called gradient boosting.

Example

import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
y = data.Price

X_train, X_valid, y_train, y_valid = train_test_split(X, y)

from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

Out[]:
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, ...)

from sklearn.metrics import mean_absolute_error

predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
# Mean Absolute Error: 241041.5160392121

参数调整

XGBoost有几个参数可以极大地影响准确性和训练速度。

n_estimators 迭代次数，过低欠拟合过高过拟合。一般设置在100 -1000

my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)

early_stopping_rounds 提供了一种自动找到n_estimators理想值的方法。当验证分停止改进时，early_stopping_rounds会导致模型停止迭代。明智的办法是为n_estimators设置一个较高的值，然后利用early_stopping_rounds来寻找最佳迭代次数。

由于随机有时会导致验证分数没有提高的一轮，所以需要在停止之前指定一个允许连续恶化多少轮的数字。设置early_stopping_rounds = 5比较合理。这种情况下，我们在连续5轮不断恶化的验证分数之后停止。
当使用early_stopping rounds时，还需要留出一些数据来计算验证分数，可以通过设置eval_set参数来完成

my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)],
             verbose=False)

learning_rate 默认为0.1

my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

n_jobs 在需要考虑运行时的较大数据集上，可以使用并行性更快地构建模型。通常将n_jobs设置为机器上的核数。在较小的数据集上,这是没有帮助的, 由此产生的模型不会有任何改善，所以对拟合时间进行微观优化通常只会分散注意力。但是，它在大型数据集中非常有用，否则您将在fit命令期间花费很长时间等待。

my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

数据泄漏

data leakage causes a model to look accurate until you start making decisions with the model, and then the model becomes very inaccurate.

two main types of leakage: target leakage and train-test contamination.

target leakage

目标泄漏指的是，在特征选择或特征工程过程中，某些与目标变量有关的信息被错误地包含在了训练集中。这可能会导致模型在测试集上的表现不可靠。例如，在预测客户流失时，如果训练集中包含了未来客户是否流失的信息，那么模型可能会表现得过于优秀，但在实际应用中却无法取得同样的效果。

import pandas as pd

data = pd.read_csv('../input/aer-credit-card-data/AER_credit_card_data.csv', 
                   true_values = ['yes'], false_values = ['no'])

y = data.card
X = data.drop(['card'], axis=1)

print("Number of rows in the dataset:", X.shape[0]) # Number of rows in the dataset: 1319
X.head()

	age	income	share	expenditure	owner	selfemp	dependents	months	majorcards	active
0	37.66667	4.5200	0.033270	124.983300	True	False	3	54	1	12
1	33.25000	2.4200	0.005217	9.854167	False	False	3	34	1	13
2	33.66667	4.5000	0.004156	15.000000	True	False	4	58	1	5
3	30.50000	2.5400	0.065214	137.869200	False	False	0	25	1	7
4	32.16667	9.7867	0.067051	546.503300	True	False	2	64	1	5

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Since there is no preprocessing, we don't need a pipeline (used anyway as best practice!)
my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))
cv_scores = cross_val_score(my_pipeline, X, y, 
                            cv=5,
                            scoring='accuracy')

print("Cross-validation accuracy: %f" % cv_scores.mean())
# Cross-validation accuracy: 0.981052

expenditures_cardholders = X.expenditure[y]
expenditures_noncardholders = X.expenditure[~y]

print('Fraction of those who did not receive a card and had no expenditures: %.2f' \
      %((expenditures_noncardholders == 0).mean()))
print('Fraction of those who received a card and had no expenditures: %.2f' \
      %(( expenditures_cardholders == 0).mean()))

# Fraction of those who did not receive a card and had no expenditures: 1.00
# Fraction of those who received a card and had no expenditures: 0.02

# Drop leaky predictors from dataset
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
X2 = X.drop(potential_leaks, axis=1)

# Evaluate the model with leaky predictors removed
cv_scores = cross_val_score(my_pipeline, X2, y, 
                            cv=5,
                            scoring='accuracy')

print("Cross-val accuracy: %f" % cv_scores.mean())

train-test contamination

训练-测试数据污染指的是，在模型评估时，使用了测试集中的信息来进行训练。这可能会导致模型在测试集上表现良好，但在实际应用中却无法取得同样的效果。例如，如果在特征缩放时使用了整个数据集的信息，包括测试集，那么测试集的信息就会泄漏到训练集中，导致模型在测试集上表现得过于优秀。

Time Series

Linear Regression With Time Series

Use two features unique to time series: lags and time steps.

import pandas as pd

df = pd.read_csv(
    "../input/ts-course-data/book_sales.csv",
    index_col='Date',
    parse_dates=['Date'],
).drop('Paperback', axis=1)

df.head()

	Hardcover
Date
2000-04-01	139
2000-04-02	128
2000-04-03	172
2000-04-04	139
2000-04-05	191

线性回归算法学习如何从其输入特征中进行加权和。对于两个特征，我们将拥有:

target = weight_1 * feature_1 + weight_2 * feature_2 + bias

Time-step features

import numpy as np
df['Time'] = np.arange(len(df.index))
df.head()

	Hardcover	Time
Date
2000-04-01	139	0
2000-04-02	128	1
2000-04-03	172	2
2000-04-04	139	3
2000-04-05	191	4

线性回归加入时间虚拟变量的模型结果为：

target = weight * time + bias

The time dummy then lets us fit curves to time series in a time plot, where Time forms the x-axis.

Lag features

df['Lag_1'] = df['Hardcover'].shift(1)
df = df.reindex(columns=['Hardcover', 'Lag_1'])

df.head()

	Hardcover	Lag_1
Date
2000-04-01	139	NaN
2000-04-02	128	139.0
2000-04-03	172	128.0
2000-04-04	139	172.0
2000-04-05	191	139.0

带滞后特征的线性回归得到模型:

target = weight * lag + bias

滞后特征（lag features）可以让我们对滞后图（lag plots）进行曲线拟合，其中系列中的每个观测值都与前一个观测值绘制在图中。

Python

list

dict

Pandas

Creating, Reading and Writing

DataFrame

Series

Indexing, Selecting & Assigning

iloc

loc

条件选择

赋值

Summary Functions and Maps

map - Series

apply - DataFrame

其他

Grouping and Sorting

Groupwise analysis

Sorting

Dtypes and Missing Values

Dtypes

Missing data

Renaming and Combining

Renaming

Combining

concat()

join()

Intro to ML

DecisionTreeRegressor

优化DecisionTreeRegressor

RandomForest

Intermediate ML

处理缺失值

Example

Score of Approach 1(drop)

Score of Approach 2 (Imputation)

Score of Approach 3 (Extension to Imputation)

imputation 比 dropping 好的原因

分类变量

Example

Score of Approach 1 (drop)

Score of Approach 2 (Ordinal Encoding)

Score of Approach 3 (One-Hot Encoding)

流水线

Example

S1 定义pipeline预处理步骤

S2 Define Model

S3 Create and Evaluate the Pipeline

交叉验证

概念

Example

XGBoost

Example

参数调整

数据泄漏

target leakage

train-test contamination

Time Series

Linear Regression With Time Series

Time-step features

Lag features

Example - Tunnel Traffic