Rでレコードリンケージデータ紐付け　データ融合

とりあえずの例 RecordLinkage パッケージ
チュートリアル
 https://journal.r-project.org/archive/2010-2/RJournal_2010-2_Sariyar+Borg.pdf
マニュアル
https://cran.r-project.org/web/packages/RecordLinkage/RecordLinkage.pdf

わかりやすい例

> library(RecordLinkage)
> namae<-c("ドラえもん","ピカチュウ","しんのすけ")
> jyuusyo<-c("東京","大阪","東京")
> d1<-data.frame(namae,jyuusyo)
> namae2<-c("ドラえもね","ピカチュウ","しんのすけ")
> jyuusyo2<-c("東京","大阪","大阪")
> d2<-data.frame(namae2,jyuusyo2)
>
>
> rpairs<-compare.linkage(d1,d2,strcmp=TRUE)
> rr<-(epiWeights(rpairs))
> results<-epiClassify(rr,0.55)
> getPairs(rr)
id namae jyuusyo Weight
1 2 ピカチュウ大阪
2 2 ピカチュウ大阪 1.0000000
3
4 1 ドラえもん東京
5 1 ドラえもね東京 0.9732683
6
7 3 しんのすけ東京
8 1 ドラえもね東京 0.6881297
9
10 3 しんのすけ東京
11 3 しんのすけ大阪 0.6682935
12
13 1 ドラえもん東京
14 3 しんのすけ大阪 0.3564232
15
16 1 ドラえもん東京
17 2 ピカチュウ大阪 0.3475126
18
19 2 ピカチュウ大阪
20 1 ドラえもね東京 0.3475126
21
22 2 ピカチュウ大阪
23 3 しんのすけ大阪 0.3317065
24
25 3 しんのすけ東京
26 2 ピカチュウ大阪 0.00000

>library(RecordLinkage)
> kudamono<-c("apple","orange","apple")
> no<-c(1,2,1)
> d1<-data.frame(kudamono,no)
> kudamono2<-c("apple","orange","applo")
> no2<-c(1,2,1)
> d2<-data.frame(kudamono2,no2)
> rpairs<-compare.linkage(d1,d2,strcmp=TRUE)
> rr<-(epiWeights(rpairs))
> summary(rr)
Linkage Data Set
3 records in data set 1
3 records in data set 2
9 record pairs
0 matches
0 non-matches
9 pairs with unknown status

Weight distribution:
[0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1]
4 0 0 5
> results<-epiClassify(rr,0.55)
> summary(results)
Linkage Data Set
3 records in data set 1
3 records in data set 2
9 record pairs
0 matches
0 non-matches
9 pairs with unknown status

Weight distribution:
[0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1]
4 0 0 5
5 links detected
0 possible links detected
4 non-links detected
Classification table:
classification
true status N P L
4 0 5
>
> getPairs(rr)
id kudamono no Weight
1 1 apple 1
2 1 apple 1 1.0000000
3
4 2 orange 2
5 2 orange 2 1.0000000
6
7 3 apple 1
8 1 apple 1 1.0000000
9
10 1 apple 1
11 3 applo 1 0.9508455
12
13 3 apple 1
14 3 applo 1 0.9508455
15
16 1 apple 1
17 2 orange 2 0.3550045
18
19 2 orange 2
20 1 apple 1 0.3550045
21
22 3 apple 1
23 2 orange 2 0.3550045
24
25 2 orange 2
26 3 applo 1 0.2799074
27
> getPairs(rr,max.weight=1,min.weight=0.9)
id kudamono no Weight
1 1 apple 1
2 1 apple 1 1.0000000
3
4 2 orange 2
5 2 orange 2 1.0000000
6
7 3 apple 1
8 1 apple 1 1.0000000
9
10 1 apple 1
11 3 applo 1 0.9508455
12
13 3 apple 1
14 3 applo 1 0.9508455
15

library(RecordLinkage)
data(RLdata500)
RLdata500[1:5,]
rpairs<-compare.dedup(RLdata500,identity=identity.RLdata500)
rpairs$pairs[1:5,]

ブロッキング：それらの列が一致しているものしか比較しない。
1列目と5列目（苗字）から7列目（誕生日）

rpairs<-compare.dedup(RLdata500,blockfld=list(1,5:7), identity=identity.RLdata500)
rpairs$pairs[c(1:3,1203:1204),]
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match
1 1 174 1 NA 0 NA 0 0 0 0
2 1 204 1 NA 0 NA 0 0 0 0
3 2 7 1 NA 0 NA 0 0 0 0
1203 448 497 1 NA 0 NA 0 0 0 0
1204 450 477 1 NA 0 NA 0 0 0 0

チェック。名前だけの一致しか調べてないような。。

> (RLdata500)[c(1,2,7,174,204,448,450,477,497),]
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd
1 CARSTEN MEIER 1949 7 22
2 GERD BAUER 1968 7 27
7 GERD SCHAEFER 1967 8 1
174 CARSTEN SCHMITT 2001 6 27
204 CARSTEN BRANDT 1938 1 14
448 SABINE OTTO 1940 7 23
450 SABINE HARTMANN 1943 6 16
477 SABINE ENGEL 1956 8 25
497 SABINE SCHNEIDER 1953 5 20

確率的レコードリンケージ

> rr<-epiWeights(rpairs)
> summary(rr)
Deduplication Data Set
500 records
1221 record pairs
49 matches
1172 non-matches
0 pairs with unknown status

Weight distribution:
[0.15,0.2] (0.2,0.25] (0.25,0.3] (0.3,0.35] (0.35,0.4] (0.4,0.45] (0.45,0.5]
1011 0 89 30 29 8 7
(0.5,0.55] (0.55,0.6] (0.6,0.65] (0.65,0.7] (0.7,0.75] (0.75,0.8]
1 14 19 10 2 1

マッチングをとりだす

l(getPairs(rpairs,0.6,0.5))
id fname_c1 fname_c2 lname_c1 lname_c2 by bm bd Weight
3658 480 SUSANNE RICHTER 1955 11 27
3659 490 SUSANNE KLEIN 1960 3 15
3660
3661 481 SUSANNE KLEIN 1969 3 15
3662 490 SUSANNE KLEIN 1960 3 15
3663
> tail(getPairs(rr,0.6,0.5))
id fname_c1 fname_c2 lname_c1 lname_c2 by bm bd Weight
40 402 CHRISTA SCHWARZ 1965 7 13
41 462 CHRISTAH SCHWARZ 1965 7 13 0.5924569
42
43 388 ANDREA WEBER 1945 5 20
44 408 ANDREA SCHMIDT 1945 2 20 0.5067013
45

まとめ

results<-(epiClassify(rr,0.55))
> summary(results)
Deduplication Data Set
500 records
1221 record pairs
49 matches
1172 non-matches
0 pairs with unknown status

Weight distribution:
[0.15,0.2] (0.2,0.25] (0.25,0.3] (0.3,0.35] (0.35,0.4] (0.4,0.45] (0.45,0.5]
1011 0 89 30 29 8 7
(0.5,0.55] (0.55,0.6] (0.6,0.65] (0.65,0.7] (0.7,0.75] (0.75,0.8]
1 14 19 10 2 1
46 links detected
0 possible links detected
1175 non-links detected
alpha error: 0.061224
beta error: 0.000000
accuracy: 0.997543