文章詳情頁(yè)

python文本處理的方案(結(jié)巴分詞并去除符號(hào))

瀏覽：2日期：2022-06-18 11:21:09

看代碼吧~

import reimport jieba.analyseimport codecsimport pandas as pddef simplification_text(xianbingshi): '''提取文本''' xianbingshi_simplification = [] with codecs.open(xianbingshi,’r’,’utf8’) as f:for line in f : line = line.strip() line_write = re.findall(’(?<=<b>).*?(?=<e>)’,line) for line in line_write:xianbingshi_simplification.append(line) with codecs.open(r’C:UsersAdministrator.SC-201812211013PycharmProjectsuntitled29yiwoqucodexianbingshi_write.txt’,’w’,’utf8’) as f:for line in xianbingshi_simplification: f.write(line + ’n’)def jieba_text(): '''''' word_list = [] data = open(r'C:UsersAdministrator.SC-201812211013PycharmProjectsuntitled29xianbingshi_write.txt', encoding=’utf-8’).read() seg_list = jieba.cut(data, cut_all=False) # 精確模式 for i in seg_list:word_list.append(i.strip()) data_quchong = pd.DataFrame({’a’:word_list}) data_quchong.drop_duplicates(subset=[’a’],keep=’first’,inplace=True) word_list = data_quchong[’a’].tolist() with codecs.open(’word.txt’,’w’,’utf8’)as w:for line in word_list: w.write(line + ’n’)def word_messy(word): '''詞語(yǔ)提煉''' word_sub_list = [] with codecs.open(word,’r’,’utf8’) as f:for line in f: line_sub = re.sub('^[1-9]d*.d*|^[A-Za-z0-9]+$|^[0-9]*$|^(-?d+)(.d+)?$|^[A-Za-z0-9]{4,40}.*?',’’,line) word_sub_list.append(line_sub) word_sub_list.sort() with codecs.open(’word.txt’,’w’,’utf8’)as w:for line in word_sub_list: w.write(line.strip('n') + ’n’)if __name__ == ’__main__’: xianbingshi = r’C:UsersAdministrator.SC-201812211013PycharmProjectsuntitled29yiwoquxianbingshi_sub_sen_all(1).txt’ # simplification_text(xianbingshi) # word = r’C:UsersAdministrator.SC-201812211013PycharmProjectsuntitled29word.txt’ simplification_text(xianbingshi)

補(bǔ)充：python 進(jìn)行結(jié)巴分詞并且用re去掉符號(hào)

看代碼吧~

# 把停用詞做成字典stopwords = {}fstop = open(’stop_words.txt’, ’r’,encoding=’utf-8’,errors=’ingnore’)for eachWord in fstop: stopwords[eachWord.strip()] = eachWord.strip() #停用詞典fstop.close()f1=open(’all.txt’,’r’,encoding=’utf-8’,errors=’ignore’)f2=open(’allutf11.txt’,’w’,encoding=’utf-8’)line=f1.readline()while line: line = line.strip() #去前后的空格 line = re.sub(r'[0-9s+.!/_,$%^*()?;；:-【】+'’]+|[+——！，;:。？、~@#￥%……&*（）]+', ' ', line) #去標(biāo)點(diǎn)符號(hào) seg_list=jieba.cut(line,cut_all=False) #結(jié)巴分詞 outStr='' for word in seg_list:if word not in stopwords: outStr+=word outStr+=' ' f2.write(outStr) line=f1.readline()f1.close()f2.close()

python文本處理的方案(結(jié)巴分詞并去除符號(hào))

以上為個(gè)人經(jīng)驗(yàn)，希望能給大家一個(gè)參考，也希望大家多多支持好吧啦網(wǎng)。

Python 編程

上一條：python numpy中multiply與*及matul 的區(qū)別說(shuō)明下一條：解決Python運(yùn)算符重載的問(wèn)題

相關(guān)文章：

1. jsp實(shí)現(xiàn)簡(jiǎn)單用戶7天內(nèi)免登錄2. ASP基礎(chǔ)入門第二篇(ASP基礎(chǔ)知識(shí))3. ASP中Server.HTMLEncode用法(附自定義函數(shù))4. ASP和PHP文件操作速度的對(duì)比5. ASP替換、保存遠(yuǎn)程圖片實(shí)現(xiàn)代碼6. adodb.recordset.open(rs.open)方法參數(shù)詳解7. jsp實(shí)現(xiàn)局部刷新頁(yè)面、異步加載頁(yè)面的方法8. 怎樣打開XML文件？xml文件如何打開?9. Spring依賴注入的三種方式實(shí)例詳解10. asp文件如何打開

排行榜

					
					PHP SESSION跨頁(yè)面?zhèn)鬟f失敗解決方案
使用 kind 和 Docker 啟動(dòng)本地的 Kubernetes環(huán)境
java實(shí)現(xiàn)基于TCP協(xié)議網(wǎng)絡(luò)socket編程(C/S通信)
Retrofit和OkHttp如何實(shí)現(xiàn)Android網(wǎng)絡(luò)緩存
JS的Form表單轉(zhuǎn)JSON格式的操作代碼
Java實(shí)現(xiàn)四連環(huán)棋游戲
django中顯示字符串的實(shí)例方法
Android 使用 SharedPreferences 保存少量數(shù)據(jù)的實(shí)現(xiàn)代碼
springboot+mybatis-plus 兩種方式打印sql語(yǔ)句的方法
docker添加網(wǎng)橋并設(shè)置ip地址范圍操作
ThinkPHP5.0之底層運(yùn)行原理執(zhí)行流程分析