Member-only story
▌Python 爬蟲細節
- 定義資料範圍(year/q/ page) 按年份、季度和頁數構造 URL 並發送 HTTP 請求抓取網頁內容。
- 使用BeautifulSoup 從 HTML 中找到股票代號ticker和投資比例percent的相關資料。
- 整理資料並存入dict (year/quarter/ ticker/ %)。
- 將dict轉為 Pandas dataframe。
- 將百分比轉換為float,並將小於3%的股票分類為 “Other”。
- 聚合資料。
from bs4 import BeautifulSoup
import re
import requests
import pandas as pd
import numpy as np
page = range(1,4)
quarter = range(1,4)
year = range(2022,2025)
my_dict = {"year":[], "quarter":[], "ticker":[],"percent":[]};
for y in year:
for q in quarter:
for p in page:
url = 'https://valuesider.com/guru/warren-buffett-berkshire-hathaway/portfolio/{}/{}?sort=-percent_portfolio&sells_page=1&page={}'.format(y, q, p)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
ticker_list = soup.find_all('div', class_ = 'guru_table_column scroll-fix text-center', string=re.compile(".+"))…