[開源] gossip_gensim 八卦版鄉民斷詞分析

開源的部份請見連結 gossip_genim in github

作法參考 以 gensim 訓練中文詞向量

訓練資料是之前從 ptt 八卦版爬下來的文章及推文,資料範圍大概是 2017年4月下旬發文
及推文的文字內容。



原本爬下來的推文是用來做 ChatBot,希望這 ChatBot 能以鄉民的口吻跟使用者聊天。
 使用 python library ChatterBot ,正所謂魔鬼藏在細節裡,當所有詞句都塞進去訓練後,
 因為資料量過於龐大(其實我覺得並不多呀~~),反應非常緩慢,一句話要數分鐘才有
 反應,嘗試就效能的部份改善這 ChatterBot 專案,還是無功而返,現在就邊走邊看是不是
 有更好效能的版本推出囉!

回過頭來,八卦鄉民斷詞分析的結果還是來看一下吧!

這裡的示範就不要跟 github 一樣,來看看有什麼好玩的。

道德感
相似詞前 10 排序
鄉願,0.9327921271324158
出於,0.9301384091377258
半斤八兩,0.9297078251838684
不同於,0.9280074834823608
大開眼界,0.9261846542358398
不一,0.9260203242301941
情意,0.923358678817749
人們,0.9223132133483887
身處,0.9210435748100281
字詞,0.9183312058448792
----------------------------
情意
相似詞前 10 排序
特質,0.9544895887374878
出於,0.9533513784408569
抨擊,0.9482460618019104
世俗,0.9448075294494629
無論是,0.9444169402122498
這裡面,0.9436346888542175
Bl,0.943479061126709
得以,0.9434195160865784
一代宗師,0.9433241486549377
邏輯性,0.940061092376709
----------------------------
流動
相似詞前 10 排序
情慾,0.8976155519485474
憧憬,0.8838305473327637
大逆轉,0.8748572468757629
自信心,0.8727014660835266
詩書,0.867963969707489
高學歷,0.867770254611969
退化,0.8648242354393005
比理組,0.859582781791687
國一,0.8591532707214355
應用,0.8580455183982849
----------------------------
情慾
相似詞前 10 排序
威逼,0.9056156277656555
流動,0.8976155519485474
跟性,0.8798336386680603
平權,0.866770327091217
權勢,0.8662711977958679
未成年人,0.8649849891662598
有性,0.8637351393699646
侵是,0.8636893630027771
師長,0.861670970916748
性,0.8614344000816345
----------------------------
威逼
相似詞前 10 排序
師長,0.9264066815376282
哄騙,0.9144564867019653
權勢,0.9088498950004578
跟性,0.907332181930542
師生,0.9060599207878113
情慾,0.9056156277656555
同性,0.8882886171340942
侵害,0.8819126486778259
生理,0.8775524497032166
倫理,0.8775419592857361
----------------------------
師長
相似詞前 10 排序
哄騙,0.9552128314971924
著重,0.9285260438919067
威逼,0.9264066815376282
師生,0.923327624797821
教導,0.9221091866493225
男師,0.9139125943183899
卸下,0.9113106727600098
權威,0.9109035730361938
遭受,0.9081582427024841
主治醫生,0.9067290425300598
----------------------------
哄騙
相似詞前 10 排序
誘騙,0.9663118720054626
師長,0.9552128314971924
情竇初開,0.9493485689163208
男師,0.9475616812705994
遭受,0.944742739200592
誘惑,0.9439117312431335
憧憬,0.9431259632110596
侵是,0.9425802826881409
感到痛苦,0.9420366287231445
脅迫,0.9414324760437012
----------------------------
誘騙
相似詞前 10 排序
哄騙,0.9663118720054626
始亂終棄,0.9537345170974731
情竇初開,0.9524070620536804
誘惑,0.9486990571022034
失身,0.9469537734985352
半推半就,0.9466378092765808
虛榮,0.943824291229248
愛慕,0.9427258372306824
感到痛苦,0.9420251250267029
暗指,0.9390961527824402
----------------------------
始亂終棄
相似詞前 10 排序
誘騙,0.9537343978881836
誘惑,0.9408020377159119
哄騙,0.9372276067733765
當小三,0.9346690773963928
相愛,0.9337821006774902
家室,0.9321751594543457
失身,0.9307398200035095
情竇初開,0.9272768497467041
甜言蜜語,0.9265762567520142
姦,0.925370991230011
----------------------------
情竇初開
相似詞前 10 排序
四五十歲,0.9527103304862976
誘騙,0.9524070620536804
哄騙,0.9493485689163208
虛榮,0.9485225081443787
一個願打,0.9455652236938477
另當別論,0.9417014718055725
失身,0.940875232219696
十三歲,0.9346972703933716
愛慕,0.9345802664756775
一個願挨,0.9334532022476196
----------------------------
小三
相似詞前 10 排序
承擔責任,0.943902850151062
交罪,0.9436488151550293
戀情,0.9421593546867371
遺願,0.9381139278411865
可告,0.9370490312576294
輿論壓力,0.9349611401557922
無證據,0.9329593777656555
形同,0.9323279857635498
師母,0.9299685955047607
本案,0.9296196699142456
----------------------------
認罪
相似詞前 10 排序
處死,0.9280140399932861
知悉,0.9244415163993835
怎說,0.9198821187019348
豈不是,0.919385552406311
洪家,0.919087827205658
誘奸,0.9166374206542969
指向,0.9160752892494202
移送,0.9148188233375549
縱放,0.9142543077468872
手下,0.9142283201217651
----------------------------
右肩
相似詞前 10 排序
誘姦,0.8586050271987915
劈,0.7917353510856628
外遇,0.7862498164176941
強姦,0.7712001204490662
仙人跳,0.7703946232795715
已婚,0.7672145962715149
吉性,0.7592676877975464
腿,0.7587223649024963
幼女,0.7491370439529419
上牀,0.7470999956130981
----------------------------
仙人跳
相似詞前 10 排序
女森耶,0.8939048647880554
HIV,0.8853669166564941
一夜情,0.8822399377822876
輪家,0.8797150254249573
下體,0.8775082230567932
小王,0.8765924572944641
還告,0.8756535053253174
去驗,0.8708358407020569
避孕,0.870829164981842
偷吃,0.8680775165557861
----------------------------
偷吃
相似詞前 10 排序
下體,0.9269814491271973
輪家,0.9078789353370667
教練,0.902589738368988
精蟲,0.9014880061149597
抽插,0.8994307518005371
完炮,0.8994146585464478
渴,0.8978879451751709
紅衣,0.8959729671478271
倒貼,0.8955304026603699
包夾,0.8954594731330872
----------------------------
HIV
相似詞前 10 排序
去驗,0.9524077773094177
要驗,0.9438307285308838
驗就驗,0.9235413670539856
爽完,0.9227591753005981
綠帽,0.9207509756088257
交個,0.9190952777862549
小頭,0.9155542850494385
調情,0.9137581586837769
AA制,0.913293719291687
戴綠帽,0.9132764935493469
----------------------------
adsl
相似詞前 10 排序
性器官,0.9044697284698486
媽寶,0.9015978574752808
陽痿,0.8987319469451904
帶壞,0.895245373249054
申裝,0.8952390551567078
鐵人,0.8945028781890869
30cm,0.8940211534500122
本樓,0.8935860395431519
變性,0.8930120468139648
點不,0.8928276896476746
----------------------------
AA
相似詞前 10 排序
記恨,0.8681396245956421
AA制,0.8568536639213562
月經,0.8565375804901123
看臉,0.8558970093727112
暖,0.8531177043914795
女森耶,0.8389519453048706
人帥,0.838537871837616
偷吃,0.8385173678398132
男森,0.8362894058227539
滯銷,0.8359618186950684
----------------------------
女森
相似詞前 10 排序
男森,0.9081128835678101
男生,0.9070939421653748
女生,0.8853805661201477
嫖妓,0.8630561232566833
正妹,0.8373980522155762
女人,0.8257011771202087
女孩,0.8236839771270752
不正,0.821898341178894
男友,0.8157771229743958
男人,0.8122578263282776
----------------------------
女森耶
相似詞前 10 排序
仙人跳,0.8939048647880554
輪家,0.8849769830703735
還告,0.8785653710365295
印男,0.8721482753753662
設局,0.8674094676971436
倫家,0.8641580939292908
男森,0.8601803183555603
完炮,0.855823278427124
月經,0.8547759056091309
衝腦,0.8521656394004822
----------------------------

留言

這個網誌中的熱門文章

[IIS] 自我簽署憑證來啟用SSL

相見恨晚的自動化測試開發工具 Sikuli

sqlplus 中文亂碼解決方案