AIã³ãã¥ããã£ã®äžå¿å°ãHugging Faceã®äžçãžããããïŒ
Hugging Faceãšã¯ïŒ ð€
Hugging FaceïŒãã®ã³ã°ãã§ã€ã¹ïŒã¯ãAIïŒäººå·¥ç¥èœïŒãšæ©æ¢°åŠç¿ïŒMLïŒã®åéã§ãéçºè ãç 究è ãã¢ãã«ãããŒã¿ã»ãããããŒã«ãå ±æããååããããã®äžå¿çãªãã©ãããã©ãŒã ã§ããããšããšã¯2016幎ã«ãã¥ãŒãšãŒã¯ã§ã10代åãã®ãã£ãããããã¢ããªãéçºããäŒæ¥ãšããŠèšç«ãããŸãããããã®åŸãæ©æ¢°åŠç¿ã®ãæ°äž»åãã䜿åœãšãã誰ããææ°ã®AIæè¡ã«ã¢ã¯ã»ã¹ããå©çšã§ããããã«ããããšãç®æããã©ãããã©ãŒã ãžãšé²åããŸããã
ãŸãã§AIã»æ©æ¢°åŠç¿çã®GitHubã®ãããªååšã§ãã³ãŒããªããžããªã ãã§ãªããåŠç¿æžã¿ã¢ãã«ãããŒã¿ã»ãããããã«ã¯ã€ã³ã¿ã©ã¯ãã£ããªãã¢ïŒSpacesïŒãŸã§ãAIéçºã«å¿ èŠãªãããããªãœãŒã¹ãéãŸã£ãŠããŸããç¹ã«èªç¶èšèªåŠçïŒNLPïŒã®åéã§åŒ·ã¿ãæã¡ãå€ãã®æå 端ã¢ãã«ãHugging FaceãéããŠå ¬éãããŠããŸãã
ãã®ãªãŒãã³ãœãŒã¹ãžã®è²¢ç®ãšã³ãã¥ããã£äž»å°ã®ã¢ãããŒãã«ãããæ¥éã«æé·ããGoogleãAmazonãNVIDIAãªã©ã®å€§æäŒæ¥ãããåºè³ãåãããŠãã³ãŒã³äŒæ¥ãšãªã£ãŠããŸãïŒ2023幎ã«ã¯è©äŸ¡é¡45åãã«ã«éãããšå ±ããããŠããŸãïŒã
Hugging Faceãšã³ã·ã¹ãã ã®äž»èŠã³ã³ããŒãã³ã ð ïž
Hugging Faceã¯åãªãã¢ãã«çœ®ãå Žã§ã¯ãããŸãããAIéçºããšã³ãããŒãšã³ãã§ãµããŒãããã匷åãªãšã³ã·ã¹ãã ã圢æããŠããŸãããã®äžå¿ãšãªãã³ã³ããŒãã³ããèŠãŠãããŸãããã
1. Hugging Face Hub: AIãªãœãŒã¹ã®éçµå° ð
Hugging Face Hubã¯ããšã³ã·ã¹ãã ã®äžæ žããªããã©ãããã©ãŒã ã§ãã以äžã®äž»èŠãªæ©èœãæäŸããŸãã
- Models: æ°åäžïŒ2024幎æç¹ã§90äžä»¥äžã2024幎8æã«ã¯ãŠãŒã¶ãŒæ°500äžäººçªç Žãšçºè¡šïŒãã®åŠç¿æžã¿AIã¢ãã«ãå ¬éã»å ±æãããŠããŸããNLPãã³ã³ãã¥ãŒã¿ããžã§ã³ãé³å£°åŠçããã«ãã¢ãŒãã«ãªã©ãå€å²ã«ãããã¿ã¹ã¯ã®ã¢ãã«ãèŠã€ãããŸããåã¢ãã«ã«ã¯ãModel Cardããä»éããã¢ãã«ã®æŠèŠã䜿ãæ¹ãå¶éäºé ããã€ã¢ã¹ãªã©ãèšè¿°ãããŠããŸãã
- Datasets: 20äžãè¶ ããïŒ2023幎8ææç¹ïŒããŒã¿ã»ãããå©çšå¯èœã§ããWikipediaèšäºããã¥ãŒã¹èšäºãç»åããŒã¿ãé³å£°ããŒã¿ãªã©ãå€æ§ãªããŒã¿ã»ããã«ã¢ã¯ã»ã¹ããã¢ãã«ã®åŠç¿ãè©äŸ¡ã«å©çšã§ããŸãã
- Spaces: æ©æ¢°åŠç¿ã¢ããªã±ãŒã·ã§ã³ã®ãã¢ãç°¡åã«äœæãããã¹ãã§ããæ©èœã§ããGradioãStreamlitãšãã£ãPythonã©ã€ãã©ãªããŸãã¯Dockerã³ã³ããã䜿ã£ãŠãã€ã³ã¿ã©ã¯ãã£ããªWebã¢ããªãå ¬éã§ããŸããããã«ãããç 究ææãéçºããã¢ãã«ãæ軜ã«å ±æãããã£ãŒãããã¯ãåŸãããšãå¯èœã§ããç¡ææ ã§ãååãªãªãœãŒã¹ïŒäŸ: 2vCPU, 16GB RAMïŒãæäŸãããå ŽåããããŸãã
- Docs: Hugging Faceã®ã©ã€ãã©ãªããã©ãããã©ãŒã ã®äœ¿ãæ¹ã«é¢ããããã¥ã¡ã³ãããã¥ãŒããªã¢ã«ãã¬ã€ããè±å¯ã«çšæãããŠããŸãã
- Community & Collaboration: GitããŒã¹ã®ãªããžããªç®¡çãDiscussionãã©ãŒã©ã ããã«ãªã¯ãšã¹ãæ©èœãªã©ãéããŠãã³ãã¥ããã£ã¡ã³ããŒãšã®ååãè°è«ãä¿é²ããŸãã
2. Transformers ã©ã€ãã©ãª: æå 端ã¢ãã«ãç°¡åã« ð€
transformers
ã¯ãHugging Faceãšã³ã·ã¹ãã ã®äžã§æãæåã§äžå¿çãªã©ã€ãã©ãªã§ããPyTorchãTensorFlowãJAXãšãã£ãäž»èŠãªãã£ãŒãã©ãŒãã³ã°ãã¬ãŒã ã¯ãŒã¯äžã§åäœããæ°åãã®åŠç¿æžã¿ã¢ãã«ïŒBERT, GPTã·ãªãŒãº, T5, ViTãªã©ïŒããããæ°è¡ã®ã³ãŒãã§ããŠã³ããŒãããæšè«ããã¡ã€ã³ãã¥ãŒãã³ã°ã«å©çšã§ããããã«èšèšãããŠããŸãã
äž»ãªæ©èœ:
- ããã¹ãåé¡ãåºæè¡šçŸæœåºã質åå¿çãèŠçŽã翻蚳ãããã¹ãçæãªã©ã®NLPã¿ã¹ã¯
- ç»ååé¡ãç©äœæ€åºãã»ã°ã¡ã³ããŒã·ã§ã³ãªã©ã®ã³ã³ãã¥ãŒã¿ããžã§ã³ã¿ã¹ã¯
- èªåé³å£°èªèãé³å£°åé¡ãªã©ã®é³å£°ã¿ã¹ã¯
- ããŒãã«QAãOCRãåç»åé¡ãªã©ã®ãã«ãã¢ãŒãã«ã¿ã¹ã¯
- ç°ãªããã¬ãŒã ã¯ãŒã¯éã§ã®ç°¡åãªçžäºéçšæ§
- ã¢ãã«ã®åŠç¿ãšè©äŸ¡ã®ããã®ãŠãŒãã£ãªãã£
ç°¡åãªäœ¿çšäŸïŒãã€ãã©ã€ã³ APIïŒ:
from transformers import pipeline
# ææ
åæãã€ãã©ã€ã³ãããŒãïŒã¢ãã«ã¯èªåã§ããŠã³ããŒããããïŒ
classifier = pipeline("sentiment-analysis")
# ããã¹ããå
¥åããŠåæ
result = classifier("Hugging Face is revolutionizing the AI community!")
print(result)
# åºåäŸ: [{'label': 'POSITIVE', 'score': 0.9998...}]
# ããã¹ãçæãã€ãã©ã€ã³
generator = pipeline("text-generation", model="gpt2")
result = generator("In a world where AI is becoming increasingly powerful,", max_length=50, num_return_sequences=1)
print(result)
# åºåäŸ: [{'generated_text': 'In a world where AI is becoming increasingly powerful, the need for...'}]
transformers
ã©ã€ãã©ãªã䜿ãã°ãè€éãªã¢ãã«ã¢ãŒããã¯ãã£ã®å®è£
ãåŠç¿æžã¿éã¿ã®ããŒããšãã£ãæéãã解æŸãããããã«ã¿ã¹ã¯ã«å¿çšã§ããŸãã
3. Datasets ã©ã€ãã©ãª: ããŒã¿åŠçãå¹çå ð
datasets
ã©ã€ãã©ãªã¯ã倧èŠæš¡ãªããŒã¿ã»ããã®å¹ççãªèªã¿èŸŒã¿ãåŠçãå
±æãç®çãšããã©ã€ãã©ãªã§ããHugging Face Hubäžã®å€æ°ã®ããŒã¿ã»ããã«ç°¡åã«ã¢ã¯ã»ã¹ã§ããã»ããããŒã«ã«ã®ã«ã¹ã¿ã ããŒã¿ã»ããïŒCSV, JSONãªã©ïŒãæ±ããŸãã
äž»ãªç¹åŸŽ:
- 1è¡ã®ã³ãŒãã§HubããããŒã¿ã»ãããããŒã:
load_dataset("dataset_name")
- ã¡ã¢ãªå¹çã®è¯ãåŠç: Apache Arrow圢åŒãå éšã§äœ¿çšããããŒã¿ã»ããå šäœãã¡ã¢ãªã«ããŒãããããã£ã¹ã¯äžã§çŽæ¥åŠçããããã巚倧ãªããŒã¿ã»ãããæ±ããŸãã
- 匷åãªããŒã¿ååŠçæ©èœ: ãããã³ã°ããã£ã«ã¿ãªã³ã°ãã·ã£ãŒãã£ã³ã°ãªã©ããµããŒãã
- å€æ§ãªåœ¢åŒãžã®ãšã¯ã¹ããŒãïŒPandas DataFrame, NumPyé åãªã©ïŒ
䜿çšäŸ:
from datasets import load_dataset
# IMDbæ ç»ã¬ãã¥ãŒããŒã¿ã»ãããããŒã
imdb_dataset = load_dataset("imdb")
# ããŒã¿ã»ããã®æ§é ã確èª
print(imdb_dataset)
# åºåäŸ:
# DatasetDict({
# train: Dataset({
# features: ['text', 'label'],
# num_rows: 25000
# })
# test: Dataset({
# features: ['text', 'label'],
# num_rows: 25000
# })
# })
# èšç·ŽããŒã¿ã®äžéšã衚瀺
print(imdb_dataset["train"][0])
# åºåäŸ: {'text': "...", 'label': 1} # label 0: negative, 1: positive
4. Tokenizers ã©ã€ãã©ãª: ããã¹ããæ°åã« ð¢
tokenizers
ã©ã€ãã©ãªã¯ãããã¹ããã¢ãã«ãç解ã§ãã圢åŒïŒããŒã¯ã³IDã®ã·ãŒã±ã³ã¹ïŒã«å€æãããããŒã¯ã³åãããã»ã¹ã«ç¹åãããé«æ§èœãªã©ã€ãã©ãªã§ããRustã§å®è£
ãããŠãããéåžžã«é«éã«åäœããŸãã
äž»ãªæ©èœ:
- BPE (Byte-Pair Encoding), WordPiece, Unigramãªã©ãäž»èŠãªããŒã¯ã³åã¢ã«ãŽãªãºã ããµããŒãã
- ããã¹ãã®æ£èŠåïŒå°æååãã¢ã¯ã»ã³ãé€å»ãªã©ïŒã
- ç¹æ®ããŒã¯ã³ïŒ[CLS], [SEP], [PAD]ãªã©ïŒã®èªåä»äžã
- å ã®ããã¹ããšããŒã¯ã³éã®ã¢ã©ã€ã¡ã³ãïŒå¯Ÿå¿é¢ä¿ïŒæ å ±ã®ä¿æã
- 倧èŠæš¡ããŒã¿ã»ããã§ã®é«éãªããã£ãã©ãªåŠç¿ã
transformers
ã©ã€ãã©ãªã®å€ãã®TokenizerïŒç¹ã«âFastâãšä»ããã®ïŒã¯ãå
éšã§ãã®tokenizers
ã©ã€ãã©ãªãå©çšããŠããŸãã
5. Accelerate ã©ã€ãã©ãª: åæ£åŠç¿ãç°¡åã« ð
accelerate
ã©ã€ãã©ãªã¯ãPyTorchã³ãŒããæå°éã®å€æŽïŒå€ãã®å Žåããããæ°è¡ïŒã§ãã·ã³ã°ã«GPUããã«ãGPUãTPUãããã«ã¯æ··å粟床åŠç¿ãDeepSpeedãªã©ã®é«åºŠãªåæ£åŠç¿ç°å¢ã«å¯Ÿå¿ãããããã®ã©ã€ãã©ãªã§ãã
è€éãªåæ£ç°å¢èšå®ãæœè±¡åããéçºè ãæ¬æ¥ã®ã¢ãã«éçºãå®éšã«éäžã§ããããã«ããŸãã
䜿çšäŸïŒæŠèŠïŒ:
from accelerate import Accelerator
import torch
# ... (ã¢ãã«ãããŒã¿ããŒããŒããªããã£ãã€ã¶ã®å®çŸ©) ...
# Acceleratorãåæå
accelerator = Accelerator()
# ã¢ãã«ããªããã£ãã€ã¶ãããŒã¿ããŒããŒãæºå
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader
)
# éåžžã®PyTorchãã¬ãŒãã³ã°ã«ãŒã
for epoch in range(num_epochs):
model.train()
for batch in train_dataloader:
outputs = model(**batch)
loss = outputs.loss
# accelerator.backward() ã䜿ã£ãŠåŸé
èšç®
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()
# ... (è©äŸ¡ã«ãŒããªã©) ...
äžèšã®ã³ãŒãã¯ãåæ£èšå®ã«å¿ããŠèªåçã«é©åãªåŠçãè¡ããŸãã
6. Evaluate ã©ã€ãã©ãª: è©äŸ¡ãæšæºå ð
evaluate
ã©ã€ãã©ãªã¯ãæ©æ¢°åŠç¿ã¢ãã«ãããŒã¿ã»ããã®è©äŸ¡ãç°¡åãã€æšæºåãããæ¹æ³ã§è¡ãããã®ããŒã«ã§ããAccuracy, Precision, Recall, F1, BLEU, ROUGEãªã©ãNLPãCVã匷ååŠç¿ãªã©æ§ã
ãªãã¡ã€ã³ã®æ°åçš®é¡ã®è©äŸ¡ææšïŒMetricsïŒãçµ±äžçãªã€ã³ã¿ãŒãã§ãŒã¹ã§å©çšã§ããŸãã
äž»ãªæ©èœ:
- ç°¡åãªææšã®ããŒããšèšç®:
evaluate.load("metric_name")
- ã¢ãã«éã®æ¯èŒïŒComparisonsïŒãããŒã¿ã»ããèªäœã®ç¹æ§è©äŸ¡ïŒMeasurementsïŒããµããŒãã
- Hubãšã®é£æº: æ°ããè©äŸ¡ææšãäœæããHubäžã®Spaceã§å ±æå¯èœã
- åææšã«ã¯ã䜿ãæ¹ã泚æç¹ãèšãããMetric Cardããä»å±ã
䜿çšäŸ:
import evaluate
# AccuracyææšãããŒã
accuracy = evaluate.load("accuracy")
# äºæž¬å€ãšåç
§å€ïŒæ£è§£ã©ãã«ïŒ
predictions = [0, 1, 0, 1, 1]
references = [0, 1, 1, 1, 0]
# ææšãèšç®
results = accuracy.compute(predictions=predictions, references=references)
print(results)
# åºåäŸ: {'accuracy': 0.6}
# BLEUã¹ã³ã¢ïŒæ©æ¢°ç¿»èš³è©äŸ¡ïŒ
bleu = evaluate.load("bleu")
predictions = ["hello there general kenobi", "foo bar foobar"]
references = [
["hello there general kenobi", "hello there !"],
["foo bar foobar"]
]
results = bleu.compute(predictions=predictions, references=references)
print(results)
# åºåäŸ: {'bleu': 1.0, ...}
Hugging Faceã䜿ã£ãŠã¿ããïŒð
Hugging Faceã®ãšã³ã·ã¹ãã ã䜿ãå§ããã®ã¯ç°¡åã§ãã
1. ã€ã³ã¹ããŒã«
ãŸããäž»èŠãªã©ã€ãã©ãªãã€ã³ã¹ããŒã«ããŸããéåžžãtransformers
ãšãå¿
èŠã«å¿ããŠdatasets
ãevaluate
ãªã©ãã€ã³ã¹ããŒã«ããŸãã
pip install transformers datasets evaluate accelerate torch # ãŸã㯠tensorflow, jax
(泚: PyTorch, TensorFlow, JAXã®ããããã®ããã¯ãšã³ããå¿ èŠã§ã)
2. ç°¡åãªãã€ãã©ã€ã³å®è¡
åè¿°ã®éããtransformers
ã®pipeline
ã䜿ãã°ãç¹å®ã®ã¿ã¹ã¯ãéåžžã«ç°¡åã«å®è¡ã§ããŸãã
from transformers import pipeline
# ãŒãã·ã§ããåé¡ãã€ãã©ã€ã³ïŒäºåã«ã©ãã«ãæå®ããªããŠãåé¡ã§ããïŒ
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
sequence_to_classify = "I spent my whole day programming."
candidate_labels = ['travel', 'cooking', 'dancing', 'programming']
result = classifier(sequence_to_classify, candidate_labels)
print(result)
# åºåäŸ:
# {'sequence': 'I spent my whole day programming.',
# 'labels': ['programming', 'cooking', 'travel', 'dancing'],
# 'scores': [0.98..., 0.00..., 0.00..., 0.00...]}
3. ã¢ãã«ãšããŒã¯ãã€ã¶ãŒã®çŽæ¥å©çš
ãã现ããå¶åŸ¡ãå¿ èŠãªå Žåã¯ãã¢ãã«ãšããŒã¯ãã€ã¶ãŒãçŽæ¥ããŒãããŠäœ¿çšããŸãã
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# ã¢ãã«åïŒHubã§æ¢ãããšãã§ããŸãïŒ
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
# ããŒã¯ãã€ã¶ãŒãšã¢ãã«ãããŒã
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# å
¥åããã¹ã
text = "This movie was great!"
# ããã¹ããããŒã¯ã³åããã¢ãã«å
¥å圢åŒã«å€æ
inputs = tokenizer(text, return_tensors="pt") # pt: PyTorch tensors
# ã¢ãã«æšè«
with torch.no_grad(): # åŸé
èšç®ãç¡å¹å
outputs = model(**inputs)
# åºåïŒlogitsïŒãã確çãèšç®ããã©ãã«ãäºæž¬
probabilities = torch.softmax(outputs.logits, dim=-1)
predicted_class_id = torch.argmax(probabilities).item()
# ã¢ãã«ã®èšå®ããIDãšã©ãã«ã®ãããã³ã°ãååŸ
print(f"Predicted label: {model.config.id2label[predicted_class_id]}")
# åºåäŸ: Predicted label: POSITIVE
4. Hugging Face Hubã®æ¢çŽ¢
ãã²Hugging Faceã®ãŠã§ããµã€ãã蚪ããŠã¿ãŠãã ãããæ§ã ãªã¢ãã«ãããŒã¿ã»ãããSpacesãå ¬éãããŠãããæ€çŽ¢ããã£ã«ã¿ãªã³ã°æ©èœã䜿ã£ãŠç®çã®ãªãœãŒã¹ãæ¢ãããšãã§ããŸãã
- ModelsããŒãž: ã¿ã¹ã¯ãã©ã€ãã©ãªãèšèªãªã©ã§ã¢ãã«ãçµã蟌ããŸãã
- DatasetsããŒãž: åæ§ã«ããŒã¿ã»ãããæ¢çŽ¢ã§ããŸãã
- SpacesããŒãž: ä»ã®ãŠãŒã¶ãŒãäœæããé¢çœããã¢ãè©Šããããèªåã®ãã¢ãäœæãããã§ããŸãã
ã¢ã«ãŠã³ããäœæããã°ãã¢ãã«ãããŒã¿ã»ããã®ã¢ããããŒãããã©ã€ããŒããªããžããªã®äœæããã£ã¹ã«ãã·ã§ã³ãžã®åå ãªã©ãå¯èœã«ãªããŸãã
Hugging Faceã®æŽ»çšã¡ãªãã âš
- éçºå¹çã®åäž: åŠç¿æžã¿ã¢ãã«ãããŒã«ã掻çšããããšã§ããŒãããã®éçºæéãšã³ã¹ããå€§å¹ ã«åæžã§ããŸãã
- ææ°æè¡ãžã®ã¢ã¯ã»ã¹: äžçäžã®ç 究è ãéçºè ãå ¬éããæå 端ã®ã¢ãã«ãããŒã¿ã»ããã«å®¹æã«ã¢ã¯ã»ã¹ã§ããŸãã
- åçŸæ§ãšå ±æã®å®¹æã: HubãéããŠã¢ãã«ãã³ãŒããç°å¢ãå ±æããããšã§ãç 究ãéçºã®åçŸæ§ãé«ãŸããã³ã©ãã¬ãŒã·ã§ã³ãä¿é²ãããŸãã
- 掻çºãªã³ãã¥ããã£: ãã©ãŒã©ã ãGitHubãªããžããªãéããŠãçåç¹ã質åãããããã£ãŒãããã¯ãåŸãããã³ã³ããªãã¥ãŒããããã§ããŸãã
- åŠç¿ãªãœãŒã¹ã®è±å¯ã: å ¬åŒããã¥ã¡ã³ãããã¥ãŒããªã¢ã«ãããã°èšäºãã³ãŒã¹ãªã©ãå å®ããŠãããåå¿è ããå°é家ãŸã§åŠã¶æ©äŒãå€ããããŸãã
å«ççãªAIãšè²¬ä»»ããéçºãžã®åãçµã¿
Hugging Faceã¯ãAIæè¡ã®æ°äž»åãé²ããäžæ¹ã§ããã®å«ççãªåŽé¢ã責任ããéçºã®éèŠæ§ãèªèããŠããŸããäŸãã°ããModel CardããéããŠã¢ãã«ã®æœåšçãªãã€ã¢ã¹ãéçã«é¢ããæ å ±ãæäŸããããAIã®å«çã«é¢ããè°è«ãä¿é²ããåãçµã¿ãè¡ã£ãŠããŸãã
éçºè ã¯ãHugging Faceã®ããŒã«ããã©ãããã©ãŒã ãå©çšããéã«ãèªèº«ãéçºã»å©çšããAIã®ç€ŸäŒç圱é¿ãèæ ®ããå ¬å¹³æ§ãéææ§ã説æ責任ãæèããããšãæ±ããããŸãã
ãŸãšã
Hugging Faceã¯ãAIãšæ©æ¢°åŠç¿ã®äžçãæ¢æ±ããéçºããããã®åŒ·åãªå³æ¹ã§ããè±å¯ãªãªãœãŒã¹ã䜿ããããã©ã€ãã©ãªã掻çºãªã³ãã¥ããã£ãéããŠãAIéçºã®ããŒãã«ã倧ããäžãã誰ããã€ãããŒã·ã§ã³ã«åå ã§ããç°å¢ãæäŸããŠããŸãã
ãã®å ¥éèšäºããããªãã®Hugging Faceãžã®ç¬¬äžæ©ãšãªãã°å¹žãã§ãããã²å®éã«è§Šã£ãŠã¿ãŠããã®å¯èœæ§ãäœéšããŠãã ããïŒð
å ¬åŒããã¥ã¡ã³ããž